AgentClash is an AI agent evaluation platform designed for real-task races. It enables users to compare AI agents under identical conditions, scoring them live based on completion, speed, token efficiency, and tool strategy. The platform provides detailed step-by-step replays to understand performance differences.
Key Features:
Define challenges using broken code or build tasks.
Utilize various models including OpenAI, Anthropic, Gemini, OpenRouter, Mistral.
Run races with composite scoring metrics.
Access full replays for transparent performance analysis.
Implement failure-to-eval detection for comprehensive evaluation.
Audience & Benefit:
Ideal for AI researchers, developers, and teams focused on evaluating or benchmarking agents. AgentClash offers objective comparisons, identifies inefficiencies, tracks improvements over time, and supports CI/CD gates for automated testing.
The platform includes a CLI for managing workspaces, infrastructure, challenge packs, deployments, runs, and authentication directly from the terminal. It can be installed via winget, making it accessible with ease.
README
AgentClash
Open-source AI-agent evaluation for real tasks. AgentClash helps teams find where agents break, replay the evidence, score the outcome, and turn failures into regression gates before release.
AgentClash is built for teams shipping agents, not leaderboard demos. It runs agents against the same workload with the same tools and constraints, then preserves the transcript, artifacts, replay, scorecard, and failure taxonomy that explain why an agent passed or failed.
npm i -g agentclash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
agentclash auth login --device
agentclash link
agentclash doctor
Released npm binaries default to the hosted API. Keep the AGENTCLASH_API_URL export when you want to be explicit or switch between hosted and self-hosted environments.
If the workspace already has challenge packs and deployments, start an eval:
For a specific completed run, use the run-first scorecard command:
agentclash eval scorecard --agent
agentclash run scorecard is lower-level and expects a run-agent ID. Use agentclash run agents when you need that ID directly.
What You Can Evaluate
Challenge packs package prompts, tools, sandboxes, input sets, validators, judges, expected artifacts, and scoring rules. Start with the challenge pack reference.
Replay and scorecards preserve the full trajectory: model calls, tool calls, sandbox commands, artifacts, verdicts, latency, cost, and failure evidence. See interpreting results.
Regression suites promote escaped failures into permanent checks so the same mistake is tested before future releases.
Datasets import or curate pinned examples, run real agent evals, record baselines, sync regression suites, and gate CI. See datasets overview.
Multi-turn packs support scripted, LLM-driven, and human user simulators with takeover commands for operator input. See multi-turn packs.
Security evals test prompt injection, secret hygiene, and sandbox or vault boundaries without copying real secrets into docs. See security evaluation.
Agent harnesses run external coding agents such as Claude Code, Codex, OpenClaw, and Hermes as first-class eval candidates in sandboxes.
CI And Release Gates
AgentClash can compare a candidate run against a baseline and fail CI when the candidate regresses.
agentclash ci init .agentclash/ci.yaml
agentclash ci validate .agentclash/ci.yaml --remote
agentclash ci run \
--manifest .agentclash/ci.yaml \
--json \
--artifact-dir agentclash-artifacts
Use the bundled GitHub Action when you want PR comments and uploaded artifacts:
AGENTCLASH_TOKEN is the automation token used by CI. AGENTCLASH_WORKSPACE is the workspace ID that should own the run and artifacts. For local CLI sessions, agentclash link can save the workspace; CI should pass both values explicitly through repository or organization secrets.
API URL resolution order is:
--api-url > AGENTCLASH_API_URL > saved user config > default
AgentClash ships Agent Skills that teach coding agents how to use the CLI, read scorecards, author packs, and gate releases.
Install first-class integration skills with the CLI:
agentclash integration claude install
agentclash integration codex install
agentclash integration cursor install
agentclash integration claude doctor
Supported CLI integration hosts are claude, codex, cursor, openclaw, hermes, and opencode. GitHub CLI skill bundles for additional hosts are documented in Use with AI tools.
Local Development
AgentClash is a monorepo:
backend/ - Go API server and Temporal worker.
cli/ - Go CLI module published through the agentclash npm package.
web/ - Next.js marketing, app, and docs site.
Run CLI checks from cli/:
cd cli
go build ./...
go vet ./...
go test -short -race -count=1 ./...