AgentClash is an AI agent evaluation platform designed for real-task races. It enables users to compare AI agents under identical conditions, scoring them live based on completion, speed, token efficiency, and tool strategy. The platform provides detailed step-by-step replays to understand performance differences.
Key Features:
Define challenges using broken code or build tasks.
Utilize various models including OpenAI, Anthropic, Gemini, OpenRouter, Mistral.
Run races with composite scoring metrics.
Access full replays for transparent performance analysis.
Implement failure-to-eval detection for comprehensive evaluation.
Audience & Benefit:
Ideal for AI researchers, developers, and teams focused on evaluating or benchmarking agents. AgentClash offers objective comparisons, identifies inefficiencies, tracks improvements over time, and supports CI/CD gates for automated testing.
The platform includes a CLI for managing workspaces, infrastructure, challenge packs, deployments, runs, and authentication directly from the terminal. It can be installed via winget, making it accessible with ease.
README
AgentClash
AI agent evaluation platform for real-task races. Compare agents with the same tools, same constraints, live scorecards, replay, and CI regression gates.
AgentClash puts AI agents on the same real task, at the same time. Scored live on completion, speed, token efficiency, and tool strategy. Step-by-step replays show exactly why one agent won and another didn't.
Head-to-head races
Composite scoring
Full replays
Failure-to-eval flywheel
How it works
Define a challenge (broken code, a build task, etc.)
Drop in your models (OpenAI, Anthropic, Gemini, OpenRouter, Mistral)
Run the race — same tools, same constraints
See scored results with full step-by-step replays
Architecture
AgentClash is a monorepo with three main components:
Component
Tech
Location
API Server
Go / chi
backend/cmd/api-server
Worker
Go / Temporal SDK
backend/cmd/worker
CLI
Go / Cobra
cli/
Web
Next.js 16 / React 19
web/
Infrastructure dependencies:
Service
Purpose
PostgreSQL 17
Source of truth for all state
Temporal
Durable workflow orchestration for run execution
Redis (optional)
WebSocket fanout, rate limiting
E2B (optional)
Sandboxed code execution for native agent runs
S3-compatible storage (optional)
Artifact storage (filesystem fallback for dev)
CLI
The agentclash CLI lets you manage everything from your terminal — runs, builds, deployments, comparisons, and infrastructure.
If you're only changing the CLI, you do not need to run the API server or worker locally. Point the local binary at a hosted API with AGENTCLASH_API_URL or --api-url.
export AGENTCLASH_API_URL="https://api.agentclash.dev"
cd cli
go run . auth login --device
go run . link
go run . run list
go run . eval start --help
# When the workspace already has challenge packs and deployments:
go run . eval start --follow
Resolution order is --api-url > AGENTCLASH_API_URL > saved user config > default. Source builds (go run ., make build) default to http://localhost:8080; released binaries default to https://api.agentclash.dev.
Quick start
agentclash auth login # Authenticate
agentclash link # Pick and save your default workspace
agentclash challenge-pack init support-eval.yaml
agentclash challenge-pack validate support-eval.yaml
agentclash challenge-pack publish support-eval.yaml
agentclash eval start --follow # Start an evaluation with guided selection
agentclash baseline set # Bookmark a baseline run
agentclash eval scorecard # View scorecard + regression verdict
If your workspace is already seeded with challenge packs and deployments, you can skip the authoring commands and start at agentclash eval start --follow.
CI/CD
AgentClash CI/CD gates the candidate agent revision against a repeatable workload. Start by committing an explicit manifest that maps repo changes to the candidate build, deployment settings, challenge pack, baseline, and release gate:
agentclash ci init .agentclash/ci.yaml
agentclash ci validate .agentclash/ci.yaml
export AGENTCLASH_WORKSPACE="your-workspace-id" # or pass -w
agentclash ci validate .agentclash/ci.yaml --remote --json
agentclash ci run --manifest .agentclash/ci.yaml --json --artifact-dir agentclash-artifacts
agentclash ci baseline --manifest .agentclash/ci.yaml --json
agentclash ci should-run --changed-file prompts/system.md --json
The current CLI validates that manifest locally by default, can optionally check referenced resource IDs against the selected workspace with --remote, runs the manifest workflow with ci run, and resolves the exact baseline run that the gate will compare against. In GitHub Actions, ci run attaches repository, pull request, branch, default branch, commit, workflow, and run URL metadata automatically and appends a Markdown gate summary when $GITHUB_STEP_SUMMARY is set. Use --artifact-dir to write uploadable result.json, run.json, scorecard.json, comparison.json, and gate.json files; use the --ci-* flags when another CI provider needs explicit source metadata. For pull request gates, prefer a locked baseline.run_id; update it only after a successful mainline run in a reviewed, auditable change.
Set regressions.promote_failures: proposed when failing CI gates should create reviewable regression candidates in the manifest's evaluation.regression_suites. The CLI deduplicates against existing non-archived/non-rejected cases and reports created, existing, skipped, blocked, and error outcomes in JSON and GitHub summaries. Use disabled to report only, or auto_on_main for protected default-branch workflows that may create active cases after refusing pull request and non-default-branch events.
For GitHub Actions, use the reusable agentclash-ci action to install the CLI, validate the manifest, skip unrelated changes, run the gate, and expose artifact paths:
Run agentclash --help for the full command reference.
Test the CLI before release
Start with the fast local checks:
cd cli
go build ./...
go vet ./...
go test -short -race -count=1 ./...
go run github.com/goreleaser/goreleaser/v2@latest check
go run github.com/goreleaser/goreleaser/v2@latest release --snapshot --clean
cd ../web && pnpm build
cd ..
bash testing/cli-e2e-suite.sh --help
If you changed packaging or install behavior, rehearse the npm packages locally from the snapshot artifacts:
node scripts/publish-npm/assemble.mjs v0.0.0-rehearse cli/dist
for p in npm-out/platforms/*/ npm-out/cli; do
(cd "$p" && npm pack --dry-run)
done
For a real local install smoke test, pack the platform package for your host plus the root wrapper, then install both into a scratch directory:
(cd npm-out/platforms/ && npm pack --pack-destination /tmp)
(cd npm-out/cli && npm pack --pack-destination /tmp)
mkdir -p /tmp/agentclash-smoke && cd /tmp/agentclash-smoke
npm init -y
npm i /tmp/agentclash-cli--*.tgz /tmp/agentclash-*.tgz
./node_modules/.bin/agentclash version
Typical triples are darwin-arm64, darwin-x64, linux-arm64, linux-x64, win32-arm64, and win32-x64.
Release the CLI to npm
Routine CLI releases should go through Release Please rather than manual npm publish.
Make a releasable CLI change under cli/ and validate it locally.
Use a conventional commit that matches the desired version bump: fix: for patch, feat: for minor, feat!: for major.
Merge to main.
Release Please opens chore(main): release x.y.z when releasable fix:, feat:, or feat!: commits have touched cli/.
Merge that release PR.
The tag-triggered .github/workflows/release-cli.yml workflow builds GitHub release assets, publishes npm, and runs smoke installs on Ubuntu, macOS, and Windows.
The one-time npm Trusted Publishing bootstrap is already documented in CLI Distribution. Normal day-to-day releases should not need manual npm website work.
Node.js 20+ and pnpm — for the web frontend (optional)
psql — PostgreSQL client for running migrations
1. Start everything (one command)
The quickest way to get the full stack running locally:
./scripts/dev/start-local-stack.sh
This starts PostgreSQL, applies migrations, launches the Temporal dev server, API server, and worker. Logs are written to /tmp/agentclash-local-stack/.
2. Start services individually
If you prefer more control, start each component separately:
Database
# Start PostgreSQL (Docker)
make db-up
# Apply schema migrations
make db-migrate
# (Optional) Seed development data
make db-seed
The default connection string is postgres://agentclash:agentclash@localhost:5432/agentclash?sslmode=disable. Override it with the DATABASE_URL environment variable.
Temporal
Start the Temporal dev server on the default port:
temporal server start-dev --namespace default
The API server and worker connect to localhost:7233 by default. Override with TEMPORAL_HOST_PORT.
API Server
make api-server
The server starts on :8080. Verify with:
curl http://localhost:8080/healthz
Worker
make worker
The worker connects to both PostgreSQL and Temporal to execute run workflows.
AgentClash is released under FSL-1.1-MIT — the
Functional Source License with an MIT Future License clause. See
LICENSE for the full text.
The short version:
You can use, modify, fork, self-host, and embed AgentClash for essentially
any purpose — internal use, commercial product development, consulting,
research, education — with one exception:
You can't offer AgentClash (or something "substantially similar") as a
commercial product or service that competes with agentclash.dev.
Every released version auto-converts to MIT on its second anniversary,
so anything released 2+ years ago is fully permissive open source.
If you want to do something this license doesn't obviously cover, email us
before you build.