ARIS-Code (Auto Research in Sleep) is a terminal-based AI research assistant built for academic researchers. Its core philosophy:
- ๐ค Executor: The primary LLM โ writes code, surveys literature, drafts papers, plans experiments
- ๐ Reviewer: An independent LLM that adversarially critiques the Executor's output via the LlmReview tool
- ๐ Iterate: Executor writes โ Reviewer critiques โ Executor revises โ loop until quality converges
With 42 bundled research skills, ARIS covers the full pipeline from idea discovery to paper submission.
๐ก Use ARIS as a skill-based workflow in Claude Code / Codex CLI / Cursor / Trae / / / , or get the full experience with the standalone CLI โ enjoy any way you like!
๐ฑ ARIS is a methodology, not a platform. What matters is the research workflow โ take it wherever you go.
๐ค AI agents: Read AGENT_GUIDE.md instead โ structured for LLM consumption, not human browsing.
๐ก๏ธ ARIS audits its own output โ now Anti-Autoresearch audits everyone's. It catalogs 39 autoresearch hack-patterns across 7 families and checks a submission for them end-to-end, producing a deterministic, reviewer-ready integrity report. Self-consistency + fabrication forensics, not an AI-text detector.
The field has put up with unreliable autoresearch long enough โAnti-Autoresearch is the read that finally catches it.
๐ฌ ARIS goes multimodal โ ARIS-Movie-Director โ hand over a fuzzy story, wake up to a cross-model-audited movie (reference run = 19 scenes). Long-horizon visual stories drift two ways (๐ง long-range forgetting ยท ๐ฃ๏ธ each frame signed off by the model that drew it); ARIS answers with the same DNA โ a research-wiki for memory + multi-agent debate so no frame signs off on itself.
> ๐งญ Not just movies โ the same audited spiral also generates clean method / flow diagrams: this very figure was baked by ARIS-Movie-Director's image_gen + cross-model panel_gate loop. ๐ Skills + an end-to-end CLI in ARIS-Movie-Director: /movie-pipeline (agent workflow + standalone deterministic CLI core) and /method-figure, the skill that made this figure.
๐๏ธ A few frames from the reference movie โ the story's own integrity beat: a run that reported +6.2 but really moved +1.4. ย โถ watch all 19 scenes โ
๐ฐ ARIS-Code v0.4.21 (2026-06) โ latest is a bug-fix patch (5 user-facing fixes from a Codex hunt: streamed CJK/emoji text no longer corrupts to ๏ฟฝ on OpenAI-compatible providers, a saved executor config no longer overrides a shell EXECUTOR_PROVIDER, truncated streams no longer save as complete, cross-line grep_search, MCP structuredContent). Headline features: v0.4.18 โ default model Claude Opus 4.8 (corrected pricing + availability fallback) and v0.4.17 โ the MCP release (mcpServers drive real tool dispatch; cross-model review needs no OpenAI API key โ aris setup wires your ChatGPT subscription in as reviewer via Codex MCP). Caps a 17-release run (v0.4.5 โ v0.4.21); per-release detail below. Credits: @GetIT-Sunday, @Anduin9527, @GO-player-hhy, @Jxy-yxJ, @screw-44, @StevenUST, @opposj, @ShijunLei-cn, @algojogacor.
> Per-release details (v0.4.5 โ v0.4.21)
>
> v0.4.21 (2026-06-28) โ bug-fix patch: 5 new user-facing bugs from a Codex adversarial hunt (all disk-verified, distinct from v0.4.20), each cross-model reviewed at a design gate and an implementation gate (gpt-5.5 xhigh; both started NO-GO โ the reviewer caught an off-by-one in the grep line-mapping and a missing stream-level test before GO). ๐ Headline: OpenAI-compatible streaming corrupted multi-byte UTF-8 (CJK / emoji) split across network chunks into ๏ฟฝ โ each HTTP body chunk was from_utf8_lossy'd independently, so a 3-byte Chinese character or 4-byte emoji straddling a chunk boundary broke on both sides (a frequent hit for Chinese users on domestic OpenAI-compatible providers โ Kimi/GLM/MiniMax/DeepSeek/Qwen/Doubao โ streaming Chinese text); the stream buffer is now raw bytes, decoding only complete SSE lines. A saved OpenAI/custom executor config no longer overrides a shell-set EXECUTOR_PROVIDER โ the startup "shell-provided vars win" path had one ungated write that re-pointed EXECUTOR_PROVIDER=anthropic โฆ aris โฆ to OpenAI (wrong executor / model-not-found). An Anthropic stream truncated after content but before a terminal signal now hard-errors (premature_eof) instead of saving a half-finished answer to history as a complete turn (symmetric to the OpenAI #249 guard; the stop_reason-only compat path is preserved, and ARIS_ALLOW_EOF_WITHOUT_STOP=1 opts a terminal-signal-less proxy back into the old behavior). grep_search with multiline: true now matches across lines in content mode (was silently empty โ count mode already worked). MCP tool results carried only in structuredContent (empty content) are no longer dropped โ the model gets the JSON structured payload. Tests (CI mode): api 32โ35 / runtime 205โ212 / tools 67 / aris-cli 172โ181 / commands 5 (+21, incl. 2 stream-level integration tests), all green. Codex MCP (gpt-5.5 xhigh): design gate (NO-GO โ GO after fixing the off-by-one) โ implementation gate (NO-GO โ GO after adding the stream-level integration tests); the Anthropic streaming spec (every stream ends with message_stop) was WebFetch-verified. Two latent-only candidates (Anthropic block-index routing, OpenAI multi-line SSE) remain deferred.
>
> v0.4.20 (2026-06-19) โ bug-fix patch: 7 user-facing bugs surfaced by a Codex adversarial hunt, each reviewed across 3 rounds (the reviewer caught a redraw gap, a trailing-blank, a spinner tail, and a blank-line edge before GO). ๐ Headline (#299): short REPL replies showed only "โ Done" โ the spinner draws "โ Thinkingโฆ" with Save/RestorePosition so streamed output overwrites it on the same line, but finish then cleared that whole line, erasing a short single-line reply. The REPL now finishes without clearing when the turn printed visible text (Clear(UntilNewLine) wipes only the spinner tail after the reply). Streamed multi-paragraph replies rendered glued ("para1para2") โ each chunk's paragraph separator was trimmed at the stream boundary; the markdown streamer now preserves separators via a held-separator so streamed output equals a single full render (no dangling blank line). Markdown tables with CJK/fullwidth content misaligned โ width now counts display cells (CJK = 2), not chars. aris "prompt" / --print ignored the executor model saved by aris setup (REPL-only before) โ a configured OpenAI/custom executor got the Anthropic default sent to its endpoint; the one-shot and REPL paths now share one resolver. Esc now actually closes the completion dropdown (it was recomputed right back). glob_search reports the total matched count when truncated (not the capped 100, which made the model think a 1000-match glob had 100 files). /model's custom menu reads the effective env the executor uses, not stale on-disk config. Tests (CI mode): api 32 / runtime 205 / tools 67 / aris-cli 172 / commands 5, all green; +7 new; real-machine verified (short reply renders + "โ Done"; paragraphs keep their blank lines). Codex MCP (gpt-5.5 xhigh): hunt โ 3 review rounds (NO-GO โ NO-GO โ GO). Two latent-only candidates (Anthropic block-index routing, OpenAI multi-line SSE) deferred to a hardening pass.
>
> v0.4.19 (2026-06-14) โ honesty / guardrails patch (theme from a Codex fresh-eyes audit; no behavior change for healthy setups). ๐ด MCP protocol-version negotiation guard โ the stdio handshake requested 2025-03-26 but never read the version the server negotiated back (a parsed-but-dead field), so a server agreeing on a version ARIS can't speak was silently accepted and later tools/list / tools/call ran on an incompatible protocol with opaque failures. ARIS now validates the negotiated version against a supported set (2025-11-25 / 2025-06-18 / 2025-03-26 / 2024-11-05 โ stdio framing is identical across these) and, on an unsupported one, terminates the child + clears the slot + surfaces a clear per-server error (aris doctor shows it) โ the "terminate when versions can't be agreed" behavior the MCP lifecycle spec requires. The request stays 2025-03-26 (proven against codex mcp-server), so healthy servers are unaffected โ verified end-to-end: the real Codex MCP server still spawns + initializes + advertises its tools under the guard. ๐งน Papercuts: the OpenAI-family subagent fail-loud message dropped its stale "lands in v0.4.18" marker (now version-agnostic + actionable, still credential-free); OpenAI upstream error bodies are now truncated (500 chars) + credential-redacted (sk-โฆ keys and Bearer โฆ tokens, via a substring scanner that catches the compact-JSON shape {"api_key":"sk-โฆ"} a misconfigured proxy can reflect back) instead of splatted verbatim; the system-prompt hook-events summary now counts only the hooks the runtime actually executes (a command hook with a command string), matching the parser. Tests (CI mode): api 32 / runtime 204 / tools 67 / aris-cli 167 / commands 5, all green; live smoke confirms the real Codex MCP server still initializes under the guard. Reviewed by Codex MCP (gpt-5.5 xhigh): design GO โ impl NO-GO (compact-secret miss + command-string strictness) โ GO after fixes.
>
> v0.4.18 (2026-06-14) โ default model โ Claude Opus 4.8, with corrected pricing and a safety net. The bump moves DEFAULT_MODEL, the opus alias, the model picker, aris setup, and the subagent default to claude-opus-4-8 โ with an availability fallback: if the account lacks 4.8 (the API returns 404 not_found_error), ARIS auto-falls-back to claude-opus-4-7 for the session, rebuilds the system-prompt model identity so it stays coherent (the model is never told it's 4.8 while served 4.7), warns once, and retries โ for the main session (text + JSON) and subagents. It fires only on that precise 404 (never 400/rate-limit/auth), latches against loops, and the text path rebuilds from a pre-turn snapshot so a retry never double-appends the user message; accounts with 4.8 are byte-identical to a plain bump. ๐ฐ Pricing corrected (verified against Anthropic's published schedule; had been a 3โ5ร over-estimate): current Opus 4.5โ4.8 = $5/$25 (deprecated Opus 4/4.1 keep $15/$75, split by word-boundary so a future opus-4-10 isn't mis-tiered); Sonnet 4.x = $3/$15 (decoupled from the generic unknown-model fallback, which stays $15/$75); Haiku was already correct. ๐งน Backlog: aris setup option 10 pins the Codex MCP reviewer to model_reasoning_effort="xhigh" (deterministic for new setups, independent of ~/.codex/config.toml; idempotent merge never clobbers an existing entry); a startup + aris doctormisconfig hint (#259) for a silently-ignored/misplaced config (malformed JSON, or a stray ~/.aris/config.yaml โ stderr-only so --print/JSON stdout stays clean); the system-prompt hook summary now marks parsed-but-never-fired events "PARSED ONLY โฆ will NOT run" instead of implying dead hooks run (full event expansion โ actually firing SessionStart/SessionEnd/โฆ โ deferred to a separate issue). Tests (CI mode): api 32 / runtime 202 / tools 67 / aris-cli 166 / commands 5, all green; a live one-shot smoke returns model=claude-opus-4-8 end-to-end. Reviewed by Codex MCP (gpt-5.5 xhigh) across the design + both implementation batches (design REWORKโGO, impl NO-GOโGO, batch-2 GO).
>
> v0.4.17 (2026-06-13) โ the MCP release. Before v0.4.17, mcpServers in settings.json parsed, showed in aris doctor, and did nothing; now ARIS spawns each configured stdio server, runs the MCP handshake, discovers tools, and advertises them as mcp____ on both provider paths (Anthropic + OpenAI-family), with end-to-end dispatch, soft per-server degradation, and an approval gate for untrusted MCP tools (external processes the sandbox can't cover; --allowedTools now accepts mcp__ names). ๐ zero-API-key cross-model reviewer: aris setup โ option 10 (Codex MCP, ChatGPT subscription, no API key) writes an idempotent mcpServers.codex entry into the settings file the runtime actually reads (atomic write + backup, explicit consent before trust: true), with an optional API reviewer as fallback (ARIS_REVIEWER_PROVIDER=codex-mcp + ARIS_REVIEWER_FALLBACK_PROVIDER); /setup in-REPL rebuilds the system prompt + runtime so reviewer changes take effect without quitting. ๐ด Protocol fix the fakes couldn't catch: real-machine e2e against codex mcp-server exposed that our stdio transport spoke LSP-style Content-Length: framing while the MCP spec (and codex) use newline-delimited JSON-RPC โ every fake-server test passed because the fakes spoke the same wrong dialect; writes are now NDJSON, reads auto-detect both, and the spec-mandated notifications/initialized is sent after initialize (a select-based round-trip also closes the #286 large-request deadlock). Hooks: object-style Claude Code hooks preserve matcher / timeout / async (anchored-regex matcher filtering; per-hook timeout, default 30 s kill = warning not deny). Long tail: ARIS_DISABLE_KEYCHAIN gate, Anthropic stop_reason clean-EOF symmetry (CL2), OpenAI tool-call index-missing merge-by-id (OE6), slash commands enter history. Real-machine push-gate hardening (the zero-key reviewer's first run): codex codex/event notification spam silenced by default (gated behind ARIS_MCP_STDERR=inherit), the system prompt now tells the model not to pass a model parameter to Codex (account default = gpt-5.5 + xhigh; arbitrary names are rejected by a ChatGPT account), and a Codex-MCP-primary-with-no-fallback LlmReview call returns a clear "use mcp__codex__codex" message instead of a misleading OPENAI_API_KEY/gpt-5.5 error. Built with the v0.4.16 zero-regression methodology (24 new characterization tests; every deliberate flip annotated in-place). Tests (CI mode): runtime 199 / aris-cli 165 / tools 67 / api 30 / commands 5, all green. Reviewed phase-by-phase by Codex MCP (gpt-5.5 xhigh) across 17 rounds (7 NO-GOs all resolved). Subagent MCP routing (P8) + MCP protocolVersion bump + hook async execution deferred to v0.4.18.
>
> v0.4.16 (2026-05-30) โ REPL UX + provider hardening, shipped under a zero-regression discipline: 64 characterization ("golden") tests locked the current provider-routing / pricing / reviewer / subagent / REPL behavior first, then stayed green through every change. Closes #274: command history now persists to ~/.config/aris/history (0600) and reloads on startup, with an ARIS_NO_HISTORY kill-switch and a disk-only secret-skip (credential-looking lines stay in in-session history but never touch disk); bash-style Ctrl+R reverse incremental search (CJK display-width-aware single-line render; no existing key binding changed; no new dependency). Security: an OpenAI-family main session (Kimi / GLM / MiniMax / โฆ) spawning a subagent previously silently billed the user's Anthropic OAuth/Keychain credential โ it now fails loud with a clear, credential-free error; full OpenAI-family subagent routing is a cross-crate change deferred to v0.4.17. Groundwork (no behavior change): the 3 byte-identical word-boundary matchers consolidate into one canonical runtime::word_match; new pure ProviderFamily classifier (unwired). Tests (CI mode): runtime 164 / aris-cli 128 / tools 49 / commands 5, all green. Codex MCP (gpt-5.5 xhigh) reviewed each phase + a final integration pass.
>
> v0.4.15 (2026-05-29) โ OpenAI-compatible streaming robustness hotfix. Closes #249: MiniMax (and other OpenAI-compatible providers / proxies) were effectively unusable because the clean-EOF completion check treated the data: [DONE] SSE sentinel as the only authoritative signal. A non-empty choices[].finish_reason is the Chat Completions spec's terminal-chunk marker; [DONE] is a transport convention some compatible providers never emit (MiniMax sends finish_reason: "stop" then closes without [DONE]). The clean-EOF decision is now a pure, unit-tested stream_eof_action(...) that completes on EITHER [DONE] OR a non-empty finish_reason; reads are NOT stopped early at finish_reason (a trailing include_usage usage-only chunk is still consumed), genuine truncation still hard-errors, and a pre-output proxy abort still restarts. Coupled fixes: OE7 reads finish_reason before the delta guard (delta-less terminal choice); OE2 flushes pending tool calls on any non-empty finish_reason; OE4 surfaces a mid-stream error envelope as a hard error instead of silently dropping it; OE3 tolerates data:{...} without the space after the colon. +5 unit tests (77โ82) extract the previously-untested SSE completion logic into pure helpers. Anthropic SSE path untouched. Codex MCP (gpt-5.5 xhigh) 3 rounds (GO-WITH-NITS โ GO-WITH-NITS โ GO).
>
> v0.4.14 (2026-05-25) โ Security-hygiene release closing the top items from the v0.4.13 codex audit (gpt-5.5 xhigh, 6/10 NEEDS-REWORK verdict). ๐ด S9 (P0) system-prompt config redaction โ before v0.4.14, render_config_section() dumped the merged settings.json verbatim into the system prompt sent to the LLM provider, leaking env, mcpServers..headers.Authorization Bearer tokens, hook command env, signed-URL query params, and apiKey fields. New renderer whitelists top-level fields (model/permissionMode/theme/outputStyle/permissions/sandbox with recursive redaction inside), redacts sensitive keys (apikey/token/secret/password/authorization/headers/env/_KEY/_SECRET/_TOKEN), replaces MCP command with placeholder, reduces MCP `url` to strict origin (scheme allow-list http/https/ws/wss, ASCII host, digit-only port, IPv6 brackets), and drops hook command strings entirely. Regression test exercises 9 distinct leak surfaces; URL parser has its own targeted test for 7 smuggling attempts including port-position secret injection (codex round-3 catch). ๐ก P9 (P1): DeepSeek aris --help now points at aris setup option 7 instead of an env-var path the resolver never honored. ๐ก M1/M2 (P1) doc: aris doctor + README/README_CN gain experimental warning whenever mcpServers.len() > 0 (full MCP tool dispatch lands v0.4.16). ๐ข C11 (P2) stream idle timeout โ both Anthropic MessageStream and the OpenAI SSE loop wrap response.chunk().await in tokio::time::timeout (env ARIS_STREAM_IDLE_TIMEOUT_SECS, default 120, clamp [10, 1800], 0/negative disables); closes the "aris hangs forever with no output" symptom when an upstream HTTPS proxy holds a connection without keepalives. Bundle: 77 skills (+1 /wiki-enrich via late same-day sync to main 7e3ab67 which also picks up upstream check_ready.sh awk + grep-c null-match fix), 54 helpers. Codex MCP 6 rounds (NO-GO + 4 โ GO-WITH-NITS + 3 โ NO-GO + port smuggling โ GO โ release metadata GO โ sync GO).
>
> v0.4.13 (2026-05-25) โ Residue-cleanup release closing every codex audit P1 carried since v0.4.10โv0.4.12, plus the long-tail regression tests. ๐ก v0.4.10 P1.D per-server MCP timeout โ mcpServers..requestTimeoutSecs override > MCP_REQUEST_TIMEOUT_SECS env > 300s default (clamped 1..=1800), so one Codex MCP agent can run 5 min while filesystem MCP errors in 5 s. ๐ก v0.4.10 known limitation closed โ McpStdioProcess::request() skips JSON-RPC notifications (id absent/null) and keeps reading until the correlated response. ๐ข meta_opt hook deploy via aris init โ tools/meta_opt/{log_event,check_ready}.sh bundle into the binary; aris init writes ARIS-namespaced aris-meta-opt-log-event.sh / aris-meta-opt-check-ready.sh to ~/.claude/hooks/ (codex round-1 #1: never clobbers user hooks); settings.json merge idempotent, backups hard-fail, final rewrite atomic via tempfile + rename. ๐งช 9 v0.4.12 targeted regression tests for sandbox.strictMode (3) + parse strictMode + provider_match pricing + has_word o-series + stream_options 400 + meaningful-content classification + premature-EOF retry truth table (codex round-1 #3 โ should_retry_on_premature_eof() extracted to pure fn, 7-row test). Bundle: 76 skills, 54 helpers (+2 meta_opt scripts vs v0.4.12). Codex 3 rounds (NO-GO + 3 โ NO-GO + metadata โ GO).
>
> v0.4.12 (2026-05-22) โ Bug-fix + small-feature release. #238 sandbox.strictMode opt-in config key; when set, SandboxConfig::resolve_request() ignores all five LLM-supplied overrides (dangerouslyDisableSandbox, namespaceRestrictions, isolateNetwork, filesystemMode, allowedMounts) โ closes the gap where a tool call could silently bypass user sandbox policy. aris doctor adds a "Sandbox:" row; bash tool schema documents the strictMode semantics. #232auto-review-loop-llm updated from legacy deepseek-chat / deepseek-reasoner (deprecate 2026-07-24; reasoner rejects tool_choice) to deepseek-v4-flash / deepseek-v4-pro. v0.4.10 audit P1 follow-ups: P1.A Anthropic stream retry gates on has_emitted_meaningful_content (a stream that only sent MessageStart before EOF is retry-eligible); P1.B supports_reasoning_effort + reviewer mirror use word-boundary match so openai/o3-mini / proxy:o4 route correctly; P1.C stream_options.include_usage:true proxy fallback retries once without on real 400 unknown-field errors; P2 pricing match precision via provider_match() so qwen3.6-plus / kimi-k2.5 route correctly while my-kimi-clone does not. Skills sync (76 skills, 52 helpers): /interview-cheatsheet + /render-html newly bundled; build.rsALLOWED_EXTS gains html for render-html templates; EXCLUDED_SKILL_PREFIXES โ starts_with("skills-codex"). CI fetch-depth: 0 + origin/main fetch so drift-test ancestor check runs. Cross-reviewed by Codex MCP (gpt-5.5 xhigh) over 4 rounds.
>
> v0.4.11 (2026-05-18) โ Skills bundle refresh + sync infrastructure. The embedded skills set in the v0.4.10 binary had fallen behind main (~6 of 56 main skills/ commits had been cherry-picked); v0.4.11 syncs the full set and ships sync infrastructure so the gap can't silently reopen. Bundle: 65โ74 user-facing skills, 34โ49 helper resources. 10 new skills bundled: /citation-audit (fourth-layer bibliography audit), /experiment-queue (SSH multi-seed job queue with OOM retry), /kill-argument (two-thread adversarial review for theory papers), /resubmit-pipeline (W5: text-only port to a new venue), /paper-talk (end-to-end conference talk pipeline), /slides-polish (per-page Codex layout review), /overleaf-sync (two-way Overleaf Git-bridge), /gemini-search + /openalex (broader literature sources), /qzcli (Qizhi GPU jobs). 46 existing SKILL.md refreshed โ most critically the canonical resolver chain rollout (closes real user incident where /research-wiki was empty for a week from hardcoded tools/research_wiki.py), submission assurance gate + external verifier (/paper-writing Phase 6 now functions). tools/ goes 9โ18: 9 baseline helpers refreshed (research_wiki.py 315โ767 lines with canonical ingest_paper API), 9 new helpers (extract_paper_style.py, figure_renderer.py, paper_illustration_image2.py, overleaf_{setup,audit}.sh, verify_wiki_coverage.sh, watchdog.py, experiment_queue/{build_manifest,queue_manager}.py). New tools/sync_main_skills.sh automates main โ bundle rsync with symlink pre-flight + codex-mirror prune + SKILLS_SOURCE_COMMIT pinning. 3 new CI drift tests in crates/runtime/src/cache.rs cover all 4 resolver layer patterns. Gemini MCP calls in /research-lit and /gemini-search now pass model: 'auto-gemini-3' (avoids silent downgrade to 2.5-pro on OAuth-personal capacity exhaustion). CLI runtime unchanged โ codex-audit P1 follow-ups remain on v0.4.12 backlog. Cross-reviewed by Codex MCP (gpt-5.5 xhigh) across 5 rounds (REQUEST CHANGES โ APPROVE WITH NITS โ NO-GO โ GO โ final GO).
>
> v0.4.10 (2026-05-17) โ Stream + MCP reliability + multi-provider pricing. C6 whole-stream restart in Anthropic MessageStream + OpenAI SSE loop on chunk decode failure / premature EOF (ARIS_STREAM_RETRY, default 2, clamp 0..=5, fires only when nothing emitted yet โ closes #228-style "error decoding response body" loop). M3 MCP stdio gains 300s default tokio::time::timeout over send+read (override MCP_REQUEST_TIMEOUT_SECS, clamp 1..=1800); response.id โ request.id correlation; ensure_server_ready()try_wait() dead-process respawn; kill().await on all failure paths so the next call starts clean (closes #151 / #172 "Calling codex..." stalls). C8/P4 OpenAI streaming requests now send stream_options.include_usage:true + parse cached_tokens; Anthropic streaming merges MessageStart.usage (input/cache) with MessageDelta.usage (output). C9 multi-provider pricing registry (15+ models, OpenAI cache_read = input ร 0.1 corrects 5ร generic overstatement, DeepSeek cache_hit/cache_miss tiers, has_word() boundary matcher for provider/ slugs). 9 dead-code warnings cleared; aris setup help text synced with actual behaviour.
>
> v0.4.9 (2026-05-17) โ Closes Codex v0.4.7 audit residuals (L1 TLS double-stack, L3 reasoning_cache compaction misalign, L4 reasoning replay unbounded). 2 new skills bundled (/figure-spec + /paper-illustration-image2 with scripts/ subdirs, new Layer 0b = $ARIS_CACHE_DIR/skills//scripts/); research_wiki.py promoted to shared tools/ (9+ callers); 5 more SKILL.md migrated to fallback chain.
>
> v0.4.8 (2026-05-17) โ Skill helper subsystem rewrite. Bundled helpers extract to ~/.config/aris/cache// at startup; every Skill invocation surfaces helperReport JSON + 4-layer resolver preamble; /skills export copies helpers; new integration-contract.md with 6 failure policies; 8 shared helpers (arxiv/deepxiv/exa/S2/openalex/save_trace/verify_papers/verify_paper_audits) bundled; /research-lit + /deepxiv migrated. Plus 4 bug fixes: gpt-5.5+tools 400 on OpenAI; Custom reviewer reset; missing signature field (#228); --version Build date hardcoded.
>
> v0.4.7 (2026-05-16) โ DashScope Coding Plan 405 fixed (#159) via native-tls switch (#225); reasoning_content replay for all reasoning models (OpenAI o1/o3/o4 / DeepSeek-R1 etc.), not just Kimi (#226); 600+ lines dead code cleanup + rustyline dep removed + "Claw Code" โ "ARIS-Code" rebrand.
>
> v0.4.6 (2026-05-14) โ ๐จ Two long-standing silent bugs fixed: PermissionMode::Prompt silently allowed every tool (derived-Ord bug); system prompt hardcoded current_date = "2026-03-31" made models reject post-cutoff data as future/prompt-injection. Plus Custom OpenAI-compatible provider (/setup option 11) with dynamic /models discovery (@Anduin9527#221 + #222).
>
> v0.4.5 (2026-05-13) โ First-class reasoning-model support: thinking content blocks end-to-end (fixes #161) + reasoning_effort='xhigh' for GPT-5.5 / o1 / o3 / o4 / DeepSeek-thinking. DeepSeek V4 Pro + Xiaomi MiMo + Qwen 3.6 + Doubao in /setup (options 7-10). Object-style hooks parser. Default model bumped to Claude Opus 4.7 + GPT-5.5. REPL input hardening (multi-line wrap / Cmd+V paste / CJK boundary). GitHub Actions CI. Credits: @GO-player-hhy (#186), @Jxy-yxJ (#171), @GetIT-Sunday (#216 partial).
>
>
>
> Older versions
>
> v0.4.4 (2026-04-20) โ Setup UX + reviewer routing fixes (resolves #158, #162) | /setup no longer forces Bearer for Anthropic + custom URL | Provider-aware proxy URL hints | Stale state no longer leaks across provider switches | LlmReview smart fallback
>
> v0.4.3 (2026-04-17) โ Third-party Anthropic-compat proxy support (Bedrock etc.) | Skip beta flags that proxies reject | Propagate custom base URL for anthropic provider | Credit @screw-44
>
> v0.4.2 (2026-04-17) โ Auto-compaction corruption fix | Compaction summary preserved on OpenAI-compat executors | Shell-provided API keys no longer erased on launch
>
> v0.4.1 (2026-04-15) โ Plan mode (/plan) | Cooperative Ctrl+C interrupt | Auto-retry (429/5xx/network) | Research Wiki ๐ (persistent knowledge base) | Self-Evolution ๐งฌ (/meta-optimize) | Local models (LM Studio/Ollama) | 62 skills synced
>
> v0.3.11 (2026-04-13) โ Reviewer Anthropic-compatible mode (Claude via proxy)
>
> v0.3.9 (2026-04-11) โ Proxy/custom base URL (CCSwitch) | Local models (LM Studio/Ollama) | Windows (experimental)
>
> v0.3.5 (2026-04-08) โ Research Wiki (persistent papers/ideas/experiments/claims + relationship graph) | Meta-Optimize self-evolution (analyze logs โ propose SKILL.md patches)
>
> v0.3.0 (2026-04-03) โ Multi-file memory index | Rich task system (TodoWrite) | /plan | Security hardening
>
> v0.2.2 (2026-04-03) โ /plan step-by-step planning | /tasks persistent tracking
>
> v0.2.1 (2026-04-03) โ Persistent Memory | Kimi K2.5 multi-turn fix | CJK cursor fix
>
> v0.2.0 (2026-04-02) โ Open source | Kimi + MiniMax + GLM support | Smart LlmReview routing | CI/CD
>
> v0.1.0 (2026-04-02) โ Initial release | Multi-executor & reviewer | 42 bundled skills
>
>
> ๐ Let Claude Code do research while you sleep. Wake up to find your paper scored, weaknesses identified, experiments run, and narrative rewritten โ autonomously.
>
> ๐ชถ Radically lightweight โ no infrastructure, zero lock-in. The entire skill layer is plain Markdown files. No framework to learn, no database to maintain, no Docker to configure, no daemon to babysit. Every skill is a single SKILL.md readable by any LLM โ swap Claude Code for Codex CLI, OpenClaw, Cursor, Trae, Antigravity, Copilot CLI, Windsurf, or your own agent and the workflows still work. Fork it, rewrite it, adapt it to your stack.
Custom Claude Code skills for autonomous ML research workflows. These skills orchestrate cross-model collaboration โ Claude Code drives the research while an external LLM (via Codex MCP) acts as a critical reviewer. ๐ Also supports alternative model combinations (Kimi, LongCat, DeepSeek, etc.) โ no Claude or OpenAI API required. For example, MiniMax-M3 + GLM-5 or GLM-5 + MiniMax-M3. ๐ค Codex CLI native โ full skill set also available for OpenAI Codex. ๐ฑ๏ธ Cursor โ works in Cursor too. ๐ฅ๏ธ Trae โ ByteDance AI IDE. ๐ Antigravity โ Google's agent-first IDE. ๐ Copilot CLI โ GitHub's terminal agent (native SKILL.md + MCP). ๐ Free tier via ModelScope โ zero cost, zero lock-in.
> ๐ญ Why not self-play with a single model? Using Claude Code subagents or agent teams for both execution and review is technically possible, but tends to fall into local minima โ the same model reviewing its own patterns creates blind spots.
>
> Think of it like adversarial vs. stochastic bandits: a single model self-reviewing is the stochastic case (predictable reward noise), while cross-model review is adversarial (the reviewer actively probes weaknesses the executor didn't anticipate) โ and adversarial bandits are fundamentally harder to game.
>
> ๐ญ Why two models, not more? Two is the minimum needed to break self-play blind spots, and 2-player games converge to Nash equilibrium far more efficiently than n-player ones. Adding more reviewers increases API cost and coordination overhead with diminishing returns โ the biggest gain is going from 1โ2, not 2โ4.
>
> Claude Code's strength is fast, fluid execution; Codex (GPT-5.5 xhigh) is slower but more deliberate and rigorous in critique. These complementary styles โ speed ร rigor โ produce better outcomes than either model talking to itself.
>
> ๐งฟ Want the strongest possible reviewer? Add โ reviewer: oracle-pro to any skill to route reviews through GPT-5.5 Pro via Oracle MCP. Pro-level reasoning for proof verification, experiment auditing, and final stress tests. Works with API key or free browser mode. Setup โ
> These are full pipelines โ you can also use each workflow independently. Already have an idea? Skip to Workflow 1.5. Have results? Jump to Workflow 3. Got reviews? Jump to Workflow 4. Want persistent memory? Enable Research Wiki. See Quick Start for all commands and Workflows for the full breakdown.
Basic mode โ give ARIS a research direction, it handles everything:
/research-pipeline "factorized gap in discrete diffusion LMs"
๐ฅ Targeted mode โ got a paper you want to improve? Give ARIS the paper + the code:
ARIS reads the paper โ finds its weaknesses โ clones the codebase โ generates ideas that specifically fix those weaknesses with that code โ runs experiments โ writes your paper. Like telling a research assistant: "read this paper, use this repo, find what's missing, and fix it."
> Mix and match: ref paper only = "what can be improved?", base repo only = "what can I build with this code?", both = "improve this paper using this code."
๐ฅ Rebuttal mode โ reviews just dropped? Don't panic. ARIS reads every concern, builds a strategy, and drafts a rebuttal that's grounded, structured, and under the character limit:
/rebuttal "paper/ + reviews" โ venue: ICML, character limit: 5000
Three safety gates โ rebuttal will NOT finalize if any fails:
๐ No fabrication โ every claim maps to paper/review/user-confirmed result
๐ No overpromise โ every promise is user-approved
๐ Full coverage โ every reviewer concern is tracked
Two outputs: PASTE_READY.txt (exact char count, paste to venue) + REBUTTAL_DRAFT_rich.md (extended version for manual editing).
Show rebuttal parameters โ venue, character limit (required), quick mode, auto experiment, stress test rounds, followup rounds
> ๐ก From idea to paper to podium โ one toolchain. ๐ฑ
2. ๐ข What's New
2026-06-20 โ ๐ Research wiki: all four node layers now have deterministic writers โ fixes "re-generated ideas not recorded" (#305, #306, #307, #308). A user hit a real bug โ ideas recorded on the first /idea-creator run vanished on re-generation โ because wiki pages were written freehand, a prose step the model skips on a re-prompt. Each layer now has a dedicated research_wiki.py writer joining ingest_paper: add_claim (claims born at /proof-checker), upsert_idea (/idea-creator), add_experiment (/result-to-claim) โ each guarded by a drift-check so it can't silently regress to dead code. A claim's status is now a strict proof axis (verified/refuted/unproven/โฆ) while experiment support is carried by supports/invalidatesedges (closing a latent contradiction the shared validator rejected), and the Codex-CLI skill mirror is synced to match. Zero behavior change when no research-wiki/ is present.
2026-06-19 โ ๐ฐ Overnight-loop resilience: silent-death watchdog + stallโstructural-pivot (#300, #301, #302; operational patterns absorbed from Deli Chen's AutoResearch framework). Two failure modes an unattended /loop / CronCreate heartbeat couldn't catch. (1) Silent death โ the heartbeat is parasitic on a living session, so context compaction or a closed session kills it and nothing notices. A new watchdog loop task type (watchdog.py) judges liveness by the state file's mtime against the loop's own stale_after_seconds, surfacing STALE / MISSING / COMPLETED to alerts.log โ detect-only, it never restarts a verdict-bearing loop. (2) Cognitive spin โ a stalled loop retries near-variants forever. A new iteration_log.py counts NEW findings per tick: stale_count โฅ 2 forces a structural pivot (change the frame + pick an untried direction), โฅ 4 escalates to a human. Both are Type-A signals โ "keep going / change direction," never "good enough"; quality still terminates in the cross-model jury.
2026-06-07 โ ๐ผ๏ธ /paper-poster-html โ new DEFAULT poster pipeline (skill #79); LaTeX /paper-poster retired to a redirect stub. Builds the poster as a single HTML/CSS file on the venue's exact print canvas and iterates by measuring, not eyeballing: hard gates (column-balance spread < 5 px, two-hue design-token discipline, real-paper-figure provenance manifest, figure-area bands) must PASS before any reviewer sees the poster; a closed fix vocabulary (token / component / rebalance / asset / canvas) structurally kills the cosmetic patch-loop; a fresh cross-model review acquits content fidelity (claimโevidence audit + final print-readiness pass). Ships 3 templates + a catalogued component library (incl. density components: equation anatomy, flow-strip, duo figures, derived-ฮ tables, claim pills) and 6 venue token packs. Core gate machinery adapted from posterly (MIT, by @Chenruishuo) โ ARIS adds the style/asset gates, the density system, and the cross-model loop. โ ๏ธ /paper-poster now redirects to /paper-poster-html; the legacy LaTeX pipeline remains only in git history.
2026-05-31 โ ๐ค Community spotlight โ two tools worth a look.Claude Fleet (@tianyilt) โ a local read-only dashboard to triage / Focus / full-text-search across many concurrent Claude Code + Codex windows. posterly (@Chenruishuo) โ a Claude Code skill that builds academic conference posters as a single HTML/CSS file โ print-ready PDF via headless Chromium (no LaTeX). Both indexed under Awesome Community. ๐ if they help you.
2026-05-31 โ ๐ฐ Fourth reviewer backend: Gemini via Antigravity CLI (#267 by @ZGJY95). โ reviewer: agy routes review through the Antigravity CLI for users without Codex MCP / Oracle โ fail-closed on the cross-model invariant (recovers + verifies the real Gemini-family model, refuses non-Gemini, binds the recovered transcript to the call via a user-event nonce). Wired into reviewer-routing.md.
2026-05-29 โ โ๏ธ ultracode-native convention layer โ fan out for breadth on any runtime tier, keep the cross-model jury sacred. Three new shared-references docs decouple breadth from verdict: fan-out-pattern.md (skills generate candidates across same-family Claude subagents โ Tier-1 Workflow / Tier-2 Agent / Tier-3 sequential โ all ending in the identical cross-model jury), acceptance-gate.md ("a loop can DRIVE, it cannot ACQUIT" โ self-judge execution-completeness, never quality/correctness), and external-cadence.md (/loop & CronCreate are fire-control, never a jury). Wired into /idea-creator, /research-lit, /proof-checker, /kill-argument (fan-out) plus 16 skills (cadence fence/affordance). Also stripped 48 vestigial Agent grants (least-privilege + a drift-check guard), fixed /idea-creator's same-family idea pre-filter, and reconciled an /auto-review-loopORโAND stop-condition inconsistency. Non-ultracode users benefit immediately โ fan-out degrades to sequential with the same final jury.
2026-05-28 โ ๐ First blog shipped: A Survey on Continuous DLM (2026 H1, 6 papers) โ long-form bilingual technical survey by Ruofeng Yang (SJTU), written end-to-end through the ARIS-in-AI-Offer workflow (Claude Opus 4.7 + Codex GPT-5.5 xhigh + Gemini auto-gemini-3 cross-model discussion). Compares ELF, ByteDance Cola-DLM, and Flow-Matching family across discrete-DLM problems, the "known-unknown" continuous space idea, training pipeline, architecture / params / shapes, inference grids + Tab 6/7 numerical results, denoising trajectories, and a Field Landscape against Cola-DLM. A 1.7 MB self-contained HTML (no build) โ demonstrates the kind of long-form analysis the /render-html toolchain can produce.
2026-05-26 โ ๐ HTML auto-emission activated at 8 workflow checkpoints. /idea-discovery, /auto-review-loop, /research-pipeline, /kill-argument, /proof-checker, /paper-claim-audit, /citation-audit, /rebuttal now auto-render their primary MD artifact to a single-file HTML view via /render-html. Cost-tiered: interim views use --no-review, audit-class / reviewer-facing deliverables keep the full Codex render-fidelity gate. Default on (RENDER_HTML = true); per-skill opt-out. Failures non-blocking โ source MD stays canonical.
2026-05-26 โ ๐ค Community PR wave โ 5 merges this week. /wiki-enrich (#247 by @hungchun0201) fills paper TODOs ingest_paper leaves as scaffolds โ Karpathy LLM-wiki principle, fetch chain alphaxivโdeepxivโarXiv. Mirror drift checker + CI (#241 by @VeraPyuyi) keeps mainโmirror in sync. /research-pipeline Stage 2/3 unified into /experiment-bridge delegation (#243 by @ZBigFish) โ old inline was a strict subset of the bridge. Windows PowerShell installer parity with reparse-chain inside-repo guard + -FromOld legacy migration + Windows CI matrix (#242 by @VeraPyuyi). Plus manual-review MCP (#246 by @ZBigFish) โ third reviewer backend โ reviewer: manual for zero-cost cross-model review (paste prompt to any non-Claude model: DeepSeek / Kimi / ChatGPT / Gemini / local llama); cross-model invariant guarded by bilingual UI banner + per-session token auth + fail-closed when MCP unavailable.
2026-05-17 โ ๐ Tools-stability roadmap (Phase 1+2+3) complete (closes #176 / #177 / #178). Community reported that helper scripts weren't propagating into user projects after install_aris.sh. Phase 1 โ every SKILL.md caller of the 10 canonical helpers now resolves via the strict-safe 3-layer chain .aris/tools/ โ tools/ โ $ARIS_REPO/tools/ documented in integration-contract.md ยง2 (which also defines 5 failure policies A/B/C/D1/D2/E). Phase 2 โ new advisory CI lint catches hardcoded python3 tools/foo.py patterns in PR-modified SKILL.md (advisory only, never fails CI). Phase 3 โ three single-owner helpers (figure-spec, paper-illustration-image2, experiment-queue) moved into their SKILL's scripts/ subdirectory; owner SKILLs use Layer 0 ${CLAUDE_SKILL_DIR}/scripts/ ahead of the canonical chain; legacy tools/ paths retained as os.execv Python forwarding shims. โ ๏ธ Existing users: no action needed โ legacy tools/ entries are now shims. If you haven't run install_aris.sh since 2026-04-30, one idempotent rerun catches everything up.
2026-05-14 โ ๐ฉน **/paper-plan + /paper-write learn GAP_REPORT.md + discipline** ([#217](https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep/issues/217)). When `โ style-ref:` is set and the user's project has structural assets (`figures/`, `results/`, `NARRATIVE_REPORT.md`, etc.), `/paper-plan` emits a **Gap Report** mapping the exemplar's section topology + density (from `style_profile.md`) against your actual assets โ surfacing slots you have **no evidence to fill** (e.g., "exemplar has 3ร4 ablation table, you have no ablation data"). Then `/paper-write` writes HTML comments instead of fabricating content at missing slots โ invisible in the compiled PDF, grep-friendly for human triage / /experiment-bridge follow-up. Narrow carve-out from the "no placeholders" rule, scoped to GAP_REPORT-listed slots only. Original idea by @zhangpelf.
2026-05-14 โ โ๏ธ Default reviewer model: gpt-5.4 โ gpt-5.5 across ~30 SKILL.md REVIEWER_MODEL defaults. Codex MCP has routed gpt-5.5 as the default since 2026-04-24; this catches the docs up to runtime. โ ๏ธ Behavior changes: (a) .aris/traces/* JSONs from prior runs are not reproducible โ re-runs on 5.5 may emit different WARN/FAIL verdicts on borderline cases (reviewer-quality lift, not regression). (b) ChatGPT Plus/Pro monthly quotas drain faster under heavy use. Fallback: pass โ reviewer-model: gpt-5.4 per invocation, or pin REVIEWER_MODEL = gpt-5.4 per skill. Oracle Pro tier (routed via โ reviewer: oracle-pro) is a separate path and unaffected.
2026-05-13 โ ๐ tools/verify_papers.py + Pre-Search Verification Protocol โ anti-hallucination filter for literature-facing skills. New helper does 3-layer fallback verification (arXiv batch API up to 40 IDs/request โ CrossRef DOI lookup โ Semantic Scholar fuzzy title match, default 0.6 word-overlap) and emits 4-state per-paper status (verified / unverified / verify_pending / error) plus a top-level verdict aligning with assurance-contract.md (PASS / WARN / BLOCKED / ERROR). Transient failures (5xx, timeouts, 429) are tagged verify_pending and excluded from the hallucination rate so network blips don't get conflated with fabricated references. Per-project cache at /.aris/cache/verify_papers.json with 30-day TTL; canonical key priority arxiv:{id_without_version} โ doi:{lowercase} โ title:{sha1[:16]}. New Pre-Search Verification Protocol subsection in shared-references/citation-discipline.md makes the split explicit: this protocol is the fast filter between SEARCH (Step 1) and full VERIFY (Step 2); /citation-audit and /paper-claim-audit remain the submission-time audit gates and are not replaced. /research-lit gets a mandatory Step 1.5: Verify Candidate Papers calling the helper; /idea-creator and /novelty-check add a Key Rule reference for cited Closest Prior Work / landscape entries. Unverified papers are retained in output tagged [UNVERIFIED] (retention-over-silent-removal) so search-quality issues stay visible. Set ARIS_VERIFY_EMAIL in your shell to lift CrossRef to the polite-pool rate. Original signal from @YiwenZhu77 in #120 โ landed via clean reimplementation rather than direct merge (PR was 5 weeks old + scope creep into figure-style).
2026-05-06 โ ๐ค /paper-talk workflow + /slides-polish skill โ end-to-end conference talk pipeline. /paper-talk orchestrates paper โ slide outline โ Beamer + PPTX โ per-page polish โ assurance audits โ final report (sister to /paper-writing, /paper-poster); composes /paper-slides, /slides-polish, plus /paper-claim-audit + /citation-audit when assurance: conference-ready. /slides-polish is the post-generation visual pass: per-page Codex review against a reference PDF + a fix-pattern catalog (PPTX font scaling 1.5-1.8ร for projector-readable size, text-frame resize after font bump, banner-as-tcolorbox, italic style leak guard, em-dash spacing, Chinese EA font hint via PingFang SC, anonymity placeholder discipline). Assurance ladder draft / polished (default) / conference-ready is independent from the effort axis; effort: lite, assurance: conference-ready is legal and means "fast pipeline, every audit must emit verdict before final". Phase 4 staging adapter materializes slide text + speaker notes + talk script as a synthetic paper directory (.aris/paper-talk/audit-input/sections/*.tex + symlinked .bib / results/ / figures/) so the existing audits run with their paper-shaped contracts and emit 6-state JSON verdicts per shared-references/assurance-contract.md.
2026-05-05 โ ๐ /resubmit-pipeline โ Workflow 5: text-only resubmit across venues (#208). Port a polished paper from one venue to another under hard constraints (no new experiments, no bib edits, no framework changes, never overwrite prior submissions). 5 phases: physical isolation โ 5-layer anonymity check โ audits (proof / claim / citation --soft-only) โ microedits via /auto-paper-improvement-loop --edit-whitelist with per-round diff gate โ adversarial gate via /kill-argument โ final compile + Overleaf push via /overleaf-sync. Two prerequisite SKILL upgrades shipped in the same PR: /auto-paper-improvement-loop --edit-whitelist (YAML schema with allowed/forbidden paths + forbidden_operations like new_cite / new_theorem_env / numerical_claim, forbidden_deletions, requires_user_approval_for, max_edits_per_round) and /citation-audit --soft-only (translates KEEP/FIX/REPLACE/REMOVE verdicts to text-rewrite proposals when bib is frozen; hallucinated citations get drop_cite_in_body_only action). Master RESUBMIT_REPORT.json ledger per shared-references/assurance-contract.md; 7-verdict failure mode table including USER_DECISION runtime state.
2026-05-05 โ ๐ก /kill-argument โ adversarial Attack-Adjudication review for theory papers (#206). Two fresh codex 5.5 + xhigh threads: Thread 1 writes the strongest 200-word rejection memo a senior area chair would produce; Thread 2 (independent adjudicator, NOT defender) reads the current paper and classifies each rejection point as answered_by_current_text / partially_answered / still_unresolved with file:line evidence. Output: KILL_ARGUMENT.{md,json}, detect-only. Integrated as Phase 5.6 of /paper-writing (between claim-audit and citation-audit) and as the canonical implementation called from /auto-paper-improvement-loop Step 5.5 โ replaces inline prompt in both places. Mandatory at assurance: submission for theory-heavy / scope-heavy papers; emits NOT_APPLICABLE for empirical papers without scope claims. Audit JSON is verify_paper_audits.sh-compatible (full schema per shared-references/assurance-contract.md, 6-state verdict). Catches the failure mode score-based reviews miss: when every local component is correct (numbers match, cites resolve, theorems prove) but the paper still oversells what it actually establishes.
2026-05-04 โ ๐ชฒ /research-wiki and 8 caller skills now resolve helper via fallback chain (#204). Bug: after bash tools/install_aris.sh the helper lives at .aris/tools/research_wiki.py (symlink), but skills hard-coded tools/research_wiki.py and silently failed when invoked โ research-wiki/ stayed empty across full W1 runs. Fix: 3-layer chain (.aris/tools/ โ tools/ โ $ARIS_REPO/tools/) codified in shared-references/wiki-helper-resolution.md. The manual-copy workaround at /tools/research_wiki.py is layer 2, so users who cp-installed the helper as a temporary fix continue to work. โ ๏ธ Existing users: rerun bash tools/install_aris.sh once โ also picks up a separate Python 3.9 ImportError fix in the helper.
2026-05-03 โ ๐จ Opt-in โ style-ref: for writer-side skills (#202). /paper-{plan,write,writing,illustration,poster,slides}, /grant-proposal, and /auto-paper-improvement-loop accept an optional โ style-ref: argument that mimics a reference paper's structural style (section ordering, theorem/figure density, sentence cadence, citation style) without copying its prose, claims, or terminology. Sources: local .tex dir/file, local PDF, arXiv id (2501.12345 or arxiv:2501.12345), HTTP/HTTPS URL. Overleaf URLs/IDs are rejected โ clone via /overleaf-sync setup first. Default OFF; existing behavior unchanged when the flag is absent. Reviewer / auditor sub-skills (/proof-checker, /paper-claim-audit, /citation-audit, the improvement-loop reviewer) never see the style ref โ cross-model review independence preserved. โ ๏ธ Existing ARIS users: the helper ships at tools/extract_paper_style.py, distributed via the .aris/tools symlink (install_aris.sh Phase 0, added in #192). Re-run bash tools/install_aris.sh once to refresh the symlink and pick up the helper. Manual fallback: cp /tools/extract_paper_style.py /tools/. Without either, the writer skill aborts with a clear error pointing here.
2026-05-02 โ ๐ชจ Community spotlight: rosetta by @SyntaxSmith. Programmatic access to ChatGPT Pro / gpt-5.5-pro / DeepResearch from Node, via Chrome CDP Fetch interception + WebSocket second-leg streaming; ships an MCP server for Claude Code / Codex / Cline. Alternative implementation path to Oracle MCP for ARIS users invoking โ reviewer: oracle-pro โ same target capability (Pro-tier reviewer), different mechanics. Indexed under Awesome Community Skills & Extensions. ๐ if you're using it!
2026-05-02 โ ๐๐งฟ Model & MCP routing updates. (a) /gemini-search default bumped to gemini-3-pro-preview (strongest Gemini, out-of-box). โ ๏ธ Action required: requires gemini-cli v0.40+ (run gemini --version; upgrade with npm i -g @google/gemini-cli if older). Legacy override: /gemini-search "topic" โ model: gemini-2.5-pro. Other overrides: gemini-3-flash-preview (faster), auto-gemini-3 (load-routed). (b) /idea-discovery Phase 1 now includes Gemini in its literature survey by default (#199) โ auto-injects โ sources: all, gemini into /research-lit unless the user passed an explicit โ sources:; graceful skip if gemini-cli not installed. (c) Oracle MCP upstream PR queue (steipete/oracle/pulls) is the first triage stop when invoking โ reviewer: oracle-pro (especially o3-deep-research / gpt-5.5-pro) โ ARIS does not vendor Oracle MCP; check upstream first if behavior surprises you (reviewer-routing.md)
2026-05-02 โ ๐ ๏ธ๐ Tools-infrastructure migration started. (a) install_aris.sh creates optional .aris/tools symlink (#192, closes #174) โ Phase 0 of the 4-step tools-stability plan (#174 โ #176 โ #177 โ #178); idempotent, zero impact until rerun. (b) /experiment-queue orchestration paths repaired (#193) โ first real user of the symlink; 7 cascading bugs fixed via 3 rounds of Codex MCP gpt-5.5 xhigh audit. Pure prose + docstring; queue_manager.py logic untouched. Windows install_aris.ps1 parallel update tracked as follow-up
2026-05-02 โ ๐ฌ Three new opt-in audit flags via fast-path delegated-agent workflow (#187, #188, #189). /citation-audit --uncited surfaces bib entries with no \cite{} reference (detect-only). /proof-checker --deep-fix adds a repair-grade plan to the Phase 1 reviewer prompt (corrected statement / patch plan / closure tests + Schur/quadratic-form algebra sanity). /proof-checker --restatement-check adds Phase 3.6 cross-location theorem drift detection (6 drift signatures). Zero behavior change when flags unset. Plus doc PRs #190 (thread-policy) + #191 (auto-loop xref). Delegated-agent + maintainer-fixup pattern; Codex MCP gpt-5.5 xhigh review caught 6+ blockers
2026-05-01 โ ๐ Gemini + OpenAlex literature sources for /research-lit (#175, community contribution by @stdAri). Two opt-in sources: /gemini-search (AI-driven discovery via jamubc/gemini-mcp-tool MCP) and /openalex (250M+ work open citation graph, no API key). Triggered via โ sources: gemini or โ sources: openalex; zero behavior change when default all (both excluded). Maintainer fixups: corrected @google/gemini-cli npm name; added try/except ImportError + bash preflight for graceful OpenAlex skip when requests missing
2026-04-30 โ ๐ /rebuttal per-reviewer thread mode + transferable patterns (SKILL.md). Adds VENUE_MODE (single_document | per_reviewer_thread) for OpenReview-style venues, reviewer_priority: pivotal routing, structural_distinction response mode, 5 reviewer-defensive heuristics, 2 Phase 5 lints, and severity-scaled stress rounds. Default VENUE_MODE = single_document keeps ICML-style behavior โ zero change for existing users. Three rounds of cross-model review before/after merge
2026-04-30 โ ๐ช Codex skill mirror rebuilt + dedicated install/update chain (#179, community contribution by @No-518). skills/skills-codex/ now mirrors all 67 mainline skills; replaces mcp__codex__codex reviewer path with Codex-native spawn_agent + send_input. New tools/install_aris_codex.sh + tools/smart_update_codex.sh handle project-local symlinks with manifest tracking. Anti-drift: tests/test_codex_skill_mirror.py + tests/test_codex_install_update.py (26 failure paths). Open discussion in #184
2026-04-24 โ ๐จ /paper-illustration-image2 โ Codex-native image generation as Phase 2b illustration backend (#166, community contribution by @kbr19-thu ๆธ ๅ). Uses ChatGPT Plus/Pro quota via local Codex app-server MCP bridge โ no GEMINI_API_KEY required. Triggered by /paper-writing โ illustration: codex-image2; default stays figurespec (zero behavior change). Async-only API, sandboxed writes to figures/ai_generated/, integration-contract-compliant helper. Marked experimental (Codex debug app-server is unstable upstream)
2026-04-21 โ ๐ Research Wiki ingest actually works now (research_wiki.py, /research-wiki). Fixes user-reported bug where /research-wiki init left papers/ empty forever (ingest subcommand had no implementation; paper-reading skills had no wiki hook). New canonical python3 tools/research_wiki.py ingest_paper helper owns slugging / metadata fetch / dedup / page render; all 6 paper-reading skills wired to it. Manual backfill via sync --arxiv-ids or sync --from-file. Ships with integration-contract.md formalizing the six-component pattern every cross-skill integration must follow
2026-04-21 โ ๐ก๏ธ Assurance Gate: โ effort: beast | max now really runs mandatory audits (assurance-contract.md, tools/verify_paper_audits.sh). Fixes silent-skip of /proof-checker / /paper-claim-audit / /citation-audit at high effort. New assurance axis (draft | submission) independent from effort: lite / balanced โ draft (zero behavior change), max / beast โ submission. At submission the 3 audits emit a JSON artifact with 6-state verdict; paper-writing Phase 6 runs the external verifier as source of truth (non-zero exit blocks Final Report). SHA256 input hashing catches stale audits. Escape hatch: โ effort: beast, assurance: draft
2026-04-20 โ ๐ฉน Project install: flat layout + manifest tracking โ fixes a real bug where the previous nested install (.claude/skills/aris/) hid skills from Claude Code's slash-command discovery (CC only scans one directory level). Anyone who ran install_aris.sh before this date was silently affected. New install_aris.sh creates one symlink per skill at .claude/skills/, writes a versioned manifest to .aris/installed-skills.txt, and is re-runnable to reconcile new/removed upstream skills. Defense-in-depth: 13 safety rules (no-symlinked-parents, exact-target revalidation, slug regex, atomic same-dir manifest rename, no-overwrite-real-files, mkdir-based portable lock, ADOPT for crash recovery, โฆ). Granular --adopt-existing / --replace-link flags replace the all-or-nothing --force. Migration paths: --from-old for legacy nested symlink, --migrate-copy keep-user|prefer-upstream for legacy nested copy. smart_update.sh --target-subdir .claude/skills/aris is now deprecated with a redirect to install_aris.sh. Stale-file bug in cp -r overlay also fixed (now rm -rf && cp -r for safe-update path)
2026-04-19 โ ๐ /overleaf-sync โ two-way bridge between local ARIS paper directory and an Overleaf project via the official Overleaf Git bridge (Premium). Lets collaborators keep editing in the Overleaf web UI while ARIS audit/edit pipelines (/paper-claim-audit, /citation-audit, /auto-paper-improvement-loop) keep running locally. Sub-commands: setup (one-time, user-driven so the agent never sees the token) / pull (with diff-protocol โ flags half-sentences, typos, claim/cite changes that should re-trigger audits) / push (with confirmation gate before writing to shared Overleaf state) / status (3-way divergence check). Token never touches the agent or any file โ primed once into macOS Keychain via the user's terminal, then auth-free for all subsequent agent operations
2026-04-19 โ ๐ /citation-audit โ fourth and final layer of the evidence-and-claim assurance stack (experiment-audit โ result-to-claim โ paper-claim-audit โ citation-audit). Fresh cross-family reviewer (gpt-5.4 via Codex MCP) with web/DBLP/arXiv lookup verifies every \cite{...} along three independent axes: existence (paper resolves at claimed arXiv ID/DOI/venue), metadata correctness (authors/year/venue/title match canonical sources), and context appropriateness (the cited paper actually establishes the claim it supports โ the most diagnostic check). Per-entry verdicts: KEEP / FIX / REPLACE / REMOVE. Auto-integrated into Workflow 3 Phase 5.8 as the pre-submission bibliography gate. Empirical motivation: in a real submission run, several real papers were cited in contexts they did not actually support, and at least one entry shipped with author = "Anonymous" โ none caught by metadata-only checks
2026-04-17 โ ๐ /experiment-queue integrated into Workflow 1.5 + research-pipeline โ experiment-bridge Phase 4 Deploy now auto-routes by milestone job count: โค5 jobs โ /run-experiment, โฅ10 jobs or phase dependencies โ /experiment-queue (with OOM retry, stale-screen cleanup, wave-transition gating, crash-safe state). New --- batch: queue override for global force-queue mode. Large multi-seed sweeps from EXPERIMENT_PLAN.md (e.g., 36-cell N ร seed ร n_train grids) now get proper orchestration without manual queue invocation
2026-04-17 โ ๐ Project-local symlink install (resolves #118) โ new recommended default install. bash tools/install_aris.sh auto-detects platform (Claude Code / Codex CLI), creates .claude/skills/aris or .agents/skills/aris symlink to the ARIS repo, adds a managed `` block to CLAUDE.md / AGENTS.md telling the agent to use only project-local skills, and records install metadata in .aris/skill-source.txt. Solves the skill collision problem when ARIS is mixed with Superpowers / OpenHands / other community packs in the same global skill directory. PowerShell version (install_aris.ps1) ships with junction support for Windows. smart_update.sh --target-subdir flag added for .agents/skills/aris (Codex) project-copy installs; symlinked installs now correctly refuse smart_update and direct users to git pull. Global install remains supported for power users
2026-04-16 โ ๐จ /figure-spec โ deterministic JSONโSVG renderer packaged as a first-class skill. Preferred default for architecture/workflow/pipeline/audit-cascade figures in papers. Shape-aware edge clipping (rect/circle/ellipse/diamond), self-loops, curved edges, multi-line labels with CJK width estimation. Editable vector output, reproducible (same spec โ same SVG), no external API. Phase 2b in Workflow 3 restored: illustration: figurespec (new default) / gemini / mermaid / false โ 4-way illustration selector with complementary strengths
2026-04-16 โ โ๏ธ /experiment-queue โ SSH job queue for multi-seed/multi-config ML experiments. Designed from real 36-cell NeurIPS sweep pain points: OOM-aware retry with backoff, stale-screen cleanup, wave-transition race prevention, teacherโstudent phase dependencies, crash-safe scheduler that resumes from JSON state. Declarative grid specs expand automatically (e.g., N ร seed ร n_train โ 36 jobs). Configurable conda_hook + gpu_free_threshold_mib for non-standard environments. Use for โฅ10 jobs; /run-experiment stays for ad-hoc
2026-04-15 โ ๐ก๏ธ Paper Writing Pipeline Hardening โ 10 empirically-motivated patches from a real NeurIPS run. REVIEWER_BIAS_GUARD=true: every review round uses a fresh thread (codex-reply inflated 3โ8/10). Reviewer Independence Protocol: no fix summaries to reviewer. Step 4.5 Restatement Regression Test: catches theorem drift across fix rounds. Step 5.5 Kill Argument Exercise: final-round adversarial attack/defense for theory papers. Location-aware overfull blocking. Theory Paper Consistency Pass in /paper-write. Enforced Bib Hygiene with DBLP/CrossRef validation. Phase 5.5 Mandatory Final Claim Audit as submission gate. Review Tracing Protocol: full prompt/response pairs saved to .aris/traces/ for reviewer-independence audit (review-tracing.md, save_trace.sh). Inspired by community contribution from @ๆๅฒ้พ
2026-04-15 โ ๐จ FigureSpec Renderer v2 โ deterministic JSONโSVG figure generation for academic papers. Shape-aware edge clipping (rect/circle/ellipse/diamond), self-loops, curved edges, multi-line labels with CJK width estimation, comprehensive validation (type checks, structure, palette). Went through 5 rounds of Codex review (3/10โ7/10). All architecture and workflow diagrams in the ARIS tech report were generated with this pipeline. New --- mode: vector for /paper-illustration skill
2026-04-14 โ ๐ /paper-claim-audit โ zero-context paper-to-evidence verification. Fresh reviewer with NO prior context compares every number in the paper against raw result files. Catches rounding inflation, best-seed cherry-pick, config mismatch, delta errors, scope overclaim. Auto-integrated into Workflow 3 (Phase 4.7). Completes the 3-layer audit chain: /experiment-audit (code) โ /result-to-claim (science) โ /paper-claim-audit (reporting). ๐๏ธ Visual PDF review also added to improvement loop โ reviewer now sees compiled PDF, not just LaTeX source. Inspired by Hermes Agent
2026-04-13 โ ๐งฟ GPT-5.4 Pro via Oracle โ โ reviewer: oracle-pro on any skill for the strongest available reviewer. API mode (fast) or browser mode (free). Supported on: /research-review, /auto-review-loop, /experiment-audit, /proof-checker, /rebuttal, /idea-creator, /research-lit. Default stays Codex xhigh. Not installed = zero impact. Setup โ
2026-04-13 โ ๐ฌ /proof-checker โ rigorous mathematical proof verification via cross-model review. 20-category issue taxonomy, two-axis severity, side-condition checklists (DCT/MCT/Fubini/IFT/...), counterexample red team, proof-obligation ledger. Auto-integrated into Workflow 3: detects \begin{theorem} and runs before improvement loop. Complements /proof-writer
2026-04-10 โ โก Effort Levels โ โ effort: lite | balanced | max | beast. Controls work intensity across all skills: papers found, ideas generated, review rounds, writing depth. Codex reasoning stays xhigh always. beast = every knob to maximum for top-venue sprints. Default balanced = zero change for existing users. Details โ
2026-04-10 โ ๐ DeepXiv integration โ progressive paper retrieval via DeepXiv CLI. Opt-in: โ sources: deepxiv or โ sources: all, deepxiv. Staged reading: search โ brief โ head โ section. pip install deepxiv-sdk to enable. Community contribution by @DreamEnding
2026-04-10 โ ๐ก๏ธ /experiment-audit โ cross-model experiment integrity verification. GPT-5.4 reads your eval scripts and results directly, checks for fake ground truth, self-normalized scores, phantom results, and scope inflation (#131, #57). Advisory โ warns loudly, never blocks. /result-to-claim auto-reads audit if present. New experiment-integrity.md shared reference. The executor must never judge its own integrity.
2026-04-10 โ ๐ง tools/smart_update.sh โ intelligent skill updater. Compares local vs upstream, detects personal customizations (server paths, API keys), only updates safe skills. bash tools/smart_update.sh --apply
2026-04-10 โ ๐ Community paper: UAV-CC โ first community paper with full PDF archived. UAV change captioning benchmark for IEEE TGRS by @wxx827. Stack: Claude Opus 4.6 + Codex 5.4 xhigh + Cursor. Papers now archived in community_papers/
2026-04-08 โ ๐ /research-wiki โ persistent research knowledge base inspired by Karpathy's LLM Wiki. Accumulates papers, ideas, experiments, and claims across the entire research lifecycle with typed relationships. Wiki-aware hooks in /research-lit (ingest papers), /idea-creator (read wiki + write ideas back), and /result-to-claim (update claim status + trigger re-ideation). Failed ideas become anti-repetition memory. ARIS now learns from its mistakes.
2026-04-05 โ ๐งฌ /meta-optimize โ outer-loop harness optimization for ARIS. Passively logs skill invocations, tool calls, failures, and parameter overrides via Claude Code hooks. Run /meta-optimize to analyze accumulated usage data and propose SKILL.md improvements โ reviewer-gated, user-approved. Inspired by Meta-Harness (Lee et al., 2026). ARIS now optimizes itself.
2026-04-04 โ ๐ง Codex Plugin deep integration โ /codex:rescue now auto-invoked when experiments fail (Workflow 1.5) or LaTeX won't compile (Workflow 3). GPT independently diagnoses the bug before Claude retries โ two AI debuggers are better than one. Optional: codex exec powers nightmare review, /codex:rescue powers auto-debug. Setup โ
2026-04-03 โ โ๏ธ Modal serverless GPU โ no GPU? gpu: modal in CLAUDE.md, one command (modal run launcher.py), no SSH, no Docker, auto scale-to-zero. $30/month free tier โ enough to try ARIS experiments without any hardware. pip install modal && modal setup and go. Community contribution by @zeyuzhangzyz
2026-04-03 โ ๐ฎ Reviewer Difficulty Levels โ medium (default, unchanged), hard (reviewer memory + debate protocol), nightmare (GPT reads repo directly via codex exec โ Claude can't hide anything). โ difficulty: nightmare for maximum stress test before submission
2026-03-30 โ ๐ฅ Auto-debug & exhaust-before-surrender โ experiment-bridge auto-diagnoses failures (OOM, import, CUDA, NaN) and retries up to 3ร. Inspired by PUA
2026-03-16 โ ๐ฌ research-refine + experiment-plan โ turn vague ideas into problem-anchored proposals with claim-driven experiment roadmaps. Now integrated into Workflow 1 (/idea-discovery). Community contribution by @zjYao36
2026-03-16 โ ๐จ๐ณ Alibaba Coding Plan guide โ one API key, 4 models (Kimi-K2.5 + Qwen3.5+ + GLM-5 + MiniMax-M2.7), dual-endpoint setup. Community contribution by @tianhao909
2026-03-15 โ ๐ Bring your own model!Any OpenAI-compatible API now works as reviewer via llm-chat MCP server. GLM, MiniMax, Kimi, LongCat, DeepSeek all tested โ zero Claude or OpenAI API needed
2026-03-15 โ ๐ proof-writer โ community skill for rigorous theorem proof drafting. ๐ Anti-hallucination citations โ /paper-write now fetches real BibTeX from DBLP/CrossRef instead of LLM-generated entries โ on by default, zero install
2026-03-14 โ ๐ฑ Feishu/Lark integration: three modes (off/push/interactive), mobile notifications for experiments, reviews, and checkpoints
2026-03-13 โ ๐ Human-in-the-loop: configurable AUTO_PROCEED checkpoints across all workflows. Full autopilot or step-by-step approval
2026-03-12 โ ๐ Zotero + Obsidian + local PDFs + arXiv/Scholar: multi-source literature search with cross-model novelty verification
2026-03-12 โ ๐ Three end-to-end workflows complete: one prompt โ top-venue-style paper. /research-pipeline chains idea discovery โ auto review โ paper writing autonomously
# 1. Install skills โ project-local symlinks (recommended)
git clone https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep.git
bash Auto-claude-code-research-in-sleep/tools/install_aris.sh ~/your-project # symlinks ARIS skills into /.claude/skills/
# (prefer a global install instead? cp -r Auto-claude-code-research-in-sleep/skills/* ~/.claude/skills/)
# 1b. Update later (when upstream changes)
cd Auto-claude-code-research-in-sleep && git pull
bash tools/smart_update.sh --apply # updates safe skills, flags your personal customizations
# Optional Codex mirror managed project install
bash tools/install_aris_codex.sh ~/your-codex-project
# Managed Codex project update
cd Auto-claude-code-research-in-sleep && git pull
bash tools/install_aris_codex.sh ~/your-codex-project --reconcile
# Copied Codex installs only (not for projects installed by install_aris_codex.sh)
bash tools/smart_update_codex.sh --local ~/.codex/skills
bash tools/smart_update_codex.sh --local ~/.codex/skills --apply
# 2. Set up Codex MCP (for review skills)
npm install -g @openai/codex
codex setup # set model to gpt-5.5 when prompted
claude mcp add codex -s user -- codex mcp-server
# 3. Use in Claude Code
claude
> /idea-discovery "your research direction" # Workflow 1 โ be specific! not "NLP" but "factorized gap in discrete diffusion LMs"
> /experiment-bridge # Workflow 1.5 โ have a plan? implement + deploy + collect results
> /auto-review-loop "your paper topic or scope" # Workflow 2: review โ fix โ re-review overnight
> /paper-writing "NARRATIVE_REPORT.md" # Workflow 3: narrative โ polished PDF
> /rebuttal "paper/ + reviews" โ venue: ICML # Workflow 4: parse reviews โ draft rebuttal โ follow-up
> /resubmit-pipeline "paper/" โ venue: NeurIPS # Workflow 5: port a polished paper to a new venue (text-only, no new experiments)
> /paper-talk "paper/" โ venue: ICLR # Workflow 6: paper โ Beamer + PPTX talk + speaker notes + assurance audits
> /research-pipeline "your research direction" # Full pipeline: Workflow 1 โ 1.5 โ 2 โ 3 end-to-end
> /research-wiki init # ๐ Enable persistent research memory (one-time)
> /meta-optimize # Meta: analyze usage logs โ propose skill improvements
๐ Research Wiki (optional) โ one-line init for persistent memory across sessions; see full Research Wiki section
Give ARIS persistent memory across sessions. Papers, ideas, failed experiments โ nothing is forgotten:
# In Claude Code:
> /research-wiki init # creates research-wiki/ in your project
# That's it. From now on, /research-lit auto-ingests papers, /idea-creator reads
# the wiki before brainstorming (and writes ideas back), /result-to-claim updates
# claim status. Failed ideas become anti-repetition memory for future ideation.
๐งฌ Meta-Optimization (optional) โ passive usage logging + /meta-optimize for data-driven SKILL.md improvements; see full Workflow M section
Run these in your normal terminal (not inside Claude Code) to enable passive usage logging:
# One-time setup in your project directory
mkdir -p .claude .aris/meta tools/meta_opt
cp Auto-claude-code-research-in-sleep/templates/claude-hooks/meta_logging.json .claude/settings.json
cp Auto-claude-code-research-in-sleep/tools/meta_opt/*.sh tools/meta_opt/
chmod +x tools/meta_opt/*.sh
# Then start Claude Code โ hooks are active immediately
claude
Events are logged to both project-level (.aris/meta/events.jsonl) and global (~/.aris/meta/events.jsonl) logs. After 5+ workflow runs, run /meta-optimize to see data-driven improvement proposals. Use /meta-optimize --global to analyze trends across all your projects.
๐ Templates + ๐ DeepXiv + ๐ Exa + ๐๏ธ Uninstall โ input templates, two extra literature sources, and the uninstall command
Then use /exa-search directly or opt into it from /research-lit with โ sources: exa or โ sources: all, exa. Covers blogs, docs, news, and research papers with built-in content extraction.
๐๏ธ Uninstall: To remove ARIS skills without affecting your own personal skills:
cd Auto-claude-code-research-in-sleep && ls skills/ | xargs -I{} rm -rf ~/.claude/skills/{}
Show all 16 inline parameters and 12 override examples โ AUTO_PROCEED / sources / arxiv download / DBLP_BIBTEX / code review / wandb / illustration / venue / base repo / gpu / compact / ref paper / effort / reviewer / difficulty (full per-skill defaults live in ยง Customization)
All pipeline behaviors are configurable via inline overrides โ append โ key: value to any command:
Parameter
Default
What it does
AUTO_PROCEED
true
Auto-continue at idea selection gate. Set false to manually pick which idea to pursue before committing GPU time
human checkpoint
false
Pause after each review round so you can read the score, give custom modification instructions, skip specific fixes, or stop early
sources
all
Which literature sources to search: zotero, obsidian, local, web, semantic-scholar, deepxiv, exa, or all. Note: semantic-scholar, deepxiv, and exa must be explicitly listed โ not included in all
arxiv download
false
Download top relevant arXiv PDFs during literature survey. When false, only fetches metadata (title, abstract, authors)
DBLP_BIBTEX
true
Fetch real BibTeX from DBLP/CrossRef instead of LLM-generated entries. Eliminates hallucinated citations. Zero install
code review
true
GPT-5.5 xhigh reviews experiment code before GPU deployment. Set false to skip
wandb
false
Auto-add W&B logging to experiment scripts. Set true + configure wandb_project in CLAUDE.md. /monitor-experiment pulls training curves from W&B
illustration
gemini
AI illustration in Workflow 3: gemini (default, needs GEMINI_API_KEY), mermaid (free), or false (skip)
GitHub repo URL to clone as base codebase (e.g., โ base repo: https://github.com/org/project). No code? Build on top of an open-source project
gpu
local
GPU target: local (default), remote (SSH server), or vast (rent on-demand from Vast.ai โ auto-provision, auto-destroy)
compact
false
Generate compact summary files (IDEA_CANDIDATES.md, findings.md, EXPERIMENT_LOG.md) for short-context models and session recovery
ref paper
false
Reference paper to build on (PDF path or arXiv URL). Summarized first, then ideas extend/improve it. Combine with base repo for paper+code workflows
effort
balanced
Work intensity: lite (0.4x tokens), balanced (default), max (2.5x), beast (5-8x). Controls breadth/depth/iterations. Codex reasoning always xhigh. See Effort Levels
reviewer
codex
Reviewer backend: codex (GPT-5.5 xhigh, default), oracle-pro (GPT-5.5 Pro via Oracle โ strongest reasoning). See Setup โ
difficulty
medium
Reviewer adversarial level: medium (default), hard (+ memory + debate), nightmare (+ GPT reads repo via codex exec)
/research-pipeline "your topic" โ AUTO_PROCEED: false # pause at idea selection gate
/research-pipeline "your topic" โ human checkpoint: true # pause after each review round to give feedback
/research-pipeline "your topic" โ sources: zotero, web # only search Zotero + web (skip local PDFs)
/research-pipeline "your topic" โ sources: all, deepxiv # default sources plus DeepXiv progressive retrieval
/research-pipeline "your topic" โ sources: all, exa # default sources plus Exa AI-powered web search
/research-pipeline "your topic" โ arxiv download: true # download top arXiv PDFs during literature survey
/research-pipeline "your topic" โ difficulty: nightmare # maximum adversarial review before submission
/research-pipeline "your topic" โ effort: beast # all knobs to maximum โ top-venue sprint
/research-pipeline "your topic" โ effort: beast, reviewer: oracle-pro # beast + GPT-5.5 Pro reviewer โ ultimate mode
/research-pipeline "your topic" โ effort: lite # quick exploration, save tokens
/research-pipeline "your topic" โ effort: max, review_rounds: 3 # max effort but cap review at 3 rounds
/research-pipeline "your topic" โ AUTO_PROCEED: false, human checkpoint: true # combine options
/proof-checker "paper/" โ reviewer: oracle-pro # Pro-level proof verification
Codex MCP config + alternative reviewer routing โ pin the model in ~/.codex/config.toml; pointers to Codex+Claude-review, Codex+Gemini-review, and the Codex mirror install chain
Important: Codex MCP uses the model from ~/.codex/config.toml, not from skill files. Make sure it says model = "gpt-5.5" (recommended). Other options: gpt-5.3-codex, gpt-5.2-codex, o3. Run codex setup or edit the file directly.
Want Codex to execute but Claude Code to review? See docs/CODEX_CLAUDE_REVIEW_GUIDE.md. That path installs the base skills/skills-codex/*, then overlays skills/skills-codex-claude-review/*, and routes review-heavy skills through the local claude-review MCP bridge.
Want Codex to execute but Gemini to review locally? See docs/CODEX_GEMINI_REVIEW_GUIDE.md and CN. That path installs the base skills/skills-codex/*, then overlays skills/skills-codex-gemini-review/*, and routes the reviewer-aware predefined skills through the local gemini-review MCP bridge using direct Gemini API by default.
Want the Codex mirror install chain? Use tools/install_aris_codex.sh for managed project installs and tools/smart_update_codex.sh for copied Codex installs. The Claude scripts remain the mainline entry points for Claude projects.
ARIS chains 79 composable skills across the whole research lifecycle โ literature & novelty โ idea discovery โ GPU experiments โ autonomous review loop โ paper writing โ peer review โ with cross-model adversarial review (Claude executes ยท GPT-5.5 xhigh reviews ยท optional GPT-5.5 Pro via Oracle), anti-hallucination DBLP/CrossRef citations, a persistent Research Wiki, flexible model backends, human-in-the-loop checkpoints, and optional Feishu / Zotero / Obsidian / GPU integrations.
๐ฅ And it scales to any agent's ultracode-style deep mode โ the breadth/firepower pass adapts to the runtime (Claude Code ultracode + workflows on Opus 4.8, Codex spawn_agent, or plain sequential), feeding three roles: breadth ยท cross-model review โ accuracy ยท research wiki โ memory. However a loop is driven, it reports to the same cross-model jury + research wiki โ it can drive, never acquit.
Full feature list
๐ 79 composable skills โ mix and match, or chain into full pipelines (/idea-discovery, /auto-review-loop, /paper-writing, /research-pipeline). See full catalog โ
๐ Literature & novelty โ multi-source paper search (Zotero + Obsidian + local PDFs + arXiv/Scholar) + cross-model novelty verification
๐ก Idea discovery โ literature survey โ brainstorm 8-12 ideas โ novelty check โ GPU pilot experiments โ ranked report
๐ Auto review loop โ 4-round autonomous review, 5/10 โ 7.5/10 overnight with 20+ GPU experiments
๐ Paper writing โ narrative โ outline โ figures โ LaTeX โ PDF โ auto-review (4/10 โ 8.5/10), one command. Anti-hallucination citations via DBLP/CrossRef
๐ค Cross-model collaboration โ Claude Code executes, GPT-5.5 xhigh reviews. Adversarial, not self-play. Optional: โ reviewer: oracle-pro โ GPT-5.5 Pro via Oracle
๐ Peer review โ review others' papers as a conference reviewer, with structured scoring and meta-review
๐ฅ๏ธ Review-driven experiments โ when GPT-5.5 says "run an ablation", Claude auto-writes the script, rsyncs to GPU, runs in screen, collects results, folds back into the paper. Configure server in CLAUDE.md (setup), or rent from Vast.ai with gpu: vast
๐ Human-in-the-loop โ configurable checkpoints at key decisions. AUTO_PROCEED=true for full autopilot, false to approve each step
๐ฑ Feishu/Lark notifications โ three modes: off (default, recommended), push-only (webhook โ mobile), interactive (approve/reject in Feishu). Zero impact when off
Push Only โ group chat cards (experiment done, checkpoint, error, pipeline complete):
Interactive โ private chat with Claude Code (approve/reject, custom instructions):
๐ Research Wiki โ persistent knowledge base across papers/ideas/experiments/claims. Failed ideas become anti-repetition memory โ ARIS gets smarter every run. Inspired by Karpathy's LLM Wiki
๐งฉ Extensible โ domain-specific skills welcome! Add a SKILL.md and open a PR. See community skills like dse-loop (architecture/EDA)
ARIS ships 79+ skills across literature, ideation, experiments, audit, writing, talks, patents, and meta-utilities โ the full catalog (role / category / requirements per skill) lives in docs/SKILLS_CATALOG.md to keep this README scannable.
Start here โ common entry points (use case โ skill)
A real overnight 4-round run on an ML research project โ the AI reviewer's score climbed 5.0/10 (borderline reject) โ 7.5/10 (review-ready) as the loop autonomously ran 20+ GPU experiments, rewrote the narrative framing, and killed claims that didn't hold up, all without human intervention.
Round-by-round breakdown
Round
Score
What Happened
Initial
5.0/10
Borderline reject
Round 1
6.5/10
Added standard metrics, discovered metric decoupling
Round 2
6.8/10
Key claim failed to reproduce, pivoted narrative
Round 3
7.0/10
Large seed study killed main improvement claim
Round 4
7.5/10 โ
Diagnostic evidence solidified, submission ready
6. ๐ Community Showcase โ Papers Built with ARIS
Real projects that used the full ARIS pipeline end-to-end. The scores listed are AI-review signals (CSPaper / Stanford Agentic Reviewer), not venue acceptances โ and since ARIS optimizes through AI-review loops, high AI scores are an expected byproduct, not proof of acceptance (human reviewers still bring literature / venue / community judgment an AI reviewer misses). Used ARIS for a paper? Open an issue / PR to be featured!
Papers + their AI-review signals (3)
Paper
AI-review signal
Submission status
Built by
Notes
CS Paper Submission
CSPaper8/10 โ AI reviewer recommendation: "Top 50% of accepted papers, clear accept"
Submitted to a CS conference; awaiting official feedback
Full ARIS pipeline: idea โ experiments โ auto-review โ paper writing. The quote is from CSPaper's simulated review, not an official venue review.
UAV change captioning benchmark. Claude Opus 4.6 (executor) + Codex GPT-5.5 xhigh (reviewer) + Cursor Opus 4.6 (assist). PDF โ
Reviewer screenshots
7. ๐งฉ Awesome Community Skills & Extensions
Domain-specific skills and external projects contributed by the community. PRs welcome โ just add a skills/your-skill/SKILL.md and open a PR!
> ๐ก How to use: Community skills are not auto-wired into core workflows. To use one, ask your executor (Claude Code / OpenClaw / etc.) to read the skill's SKILL.md, then plug it into the appropriate workflow stage based on the description below.
> ๐ Thanks to every contributor! We fold the tables below to keep the README readable โ but every skill and project here is equally valued. PRs always welcome!
DEPRECATED โ redirect stub to the core /paper-poster-html (measurement-gated HTML/CSS pipeline); the legacy LaTeX implementation lives in git history
Programmatic access to ChatGPT Pro / gpt-5.5-pro / DeepResearch from Node, via Chrome CDP Fetch interception + WebSocket second-leg streaming. Ships an MCP server for Claude Code / Codex / Cline โ alternative implementation path to Oracle MCP for โ reviewer: oracle-pro style high-tier review. Supports multi-turn, parallel concurrency, live token deltas, 15-min idle-timeout watchdog (long Pro thinks survive). MIT, by @SyntaxSmith
Convert research papers (PDF/LaTeX) into interactive six-module HTML courses with formula breakdowns, literature timelines, quizzes, and glossary tooltips โ single bundled file, no server needed
Official MiniMax CLI โ text, image, video, speech, and music generation + web search. skill/SKILL.md follows the agentskills.io standard. Drop-in companion for the Alt B (MiniMax reviewer) setup
Academic conference posters as a single HTML/CSS file โ print-ready PDF via headless Chromium (no LaTeX). A Claude Code skill โ its gate machinery now powers ARISโs default /paper-poster-html. By @Chenruishuo
Local read-only dashboard for many concurrent Claude Code / Codex windows โ triage (working / waiting-on-you / done), one-click Focus, ~50ms full-text search across transcripts, skill/memory analytics. By @tianyilt
8. ๐ Workflows
These skills compose into a full research lifecycle. Each workflow can be used independently or chained together:
Exploring a new area (e.g., writing a survey)? Start with Workflow 1 โ /idea-discovery
Have a plan, need to implement and run? Workflow 1.5 โ /experiment-bridge
Already have results, need iterative improvement? Workflow 2 โ /auto-review-loop
Ready to write the paper? Workflow 3 โ /paper-writing (or step by step: /paper-plan โ /paper-figure โ /paper-write โ /paper-compile โ /auto-paper-improvement-loop)
Got reviews back? Need to rebuttal? Workflow 4 โ /rebuttal โ parse reviews, draft safe rebuttal, follow-up rounds
Full pipeline? Workflow 1 โ 1.5 โ 2 โ 3 โ submit โ 4 โ /research-pipeline + /rebuttal โ from idea through submission and rebuttal
Want ARIS to remember and learn? ๐ /research-wiki init โ persistent memory across sessions. Papers, ideas, failed experiments compound over time
Want ARIS to improve itself? Workflow M โ /meta-optimize โ analyze usage logs, propose skill improvements, reviewer-gated
> โ ๏ธ Important: These tools accelerate research, but they don't replace your own critical thinking. Always review generated ideas with your domain expertise, question the assumptions, and make the final call yourself. The best research comes from human insight + AI execution, not full autopilot.
Workflow 1: Idea Discovery & Method Refinement ๐
> "What's the state of the art? Where are the gaps? How do we solve it?"
Don't have a concrete idea yet? Just give a research direction โ /idea-discovery handles the rest:
๐ Survey the landscape (recent papers, open problems, recurring limitations)
๐ง Brainstorm 8-12 concrete ideas via GPT-5.5 xhigh
๐ Filter by feasibility, compute cost, and quick novelty search
๐ก๏ธ Validate top ideas with deep novelty check + devil's advocate review
๐งช Pilot top 2-3 ideas in parallel on different GPUs (30 min - 2 hr each)
๐ Rank by empirical signal โ ideas with positive pilot results rise to the top
๐ฌ Refine the top idea into a problem-anchored proposal via iterative GPT-5.5 review
๐งช Plan claim-driven experiments with ablations, budgets, and run order
The output is a ranked IDEA_REPORT.md plus a refined proposal (refine-logs/FINAL_PROPOSAL.md) and experiment plan (refine-logs/EXPERIMENT_PLAN.md) for the top idea. Dead-end ideas are documented too, saving future exploration.
Show W1 flow diagram and example command sequence โ research-lit โ idea-creator โ novelty-check โ research-refine โ experiment-plan
> ๐ก One-command shortcut:/idea-discovery "your research direction" runs this entire workflow automatically.
> ๐ Human-in-the-loop: Each phase presents results and waits for your feedback. Not happy? Tell it what's missing โ it refines the prompt and regenerates. Trust the defaults? It auto-proceeds with the top-ranked option. You decide how hands-on to be.
> โ๏ธ Pilot experiment budgets (max hours, timeout, GPU budget) are configurable โ see Customization.
> ๐ก One-command shortcut:/experiment-bridge reads refine-logs/EXPERIMENT_PLAN.md automatically. Or point it to any plan: /experiment-bridge "my_plan.md".
> โ๏ธ CODE_REVIEW, AUTO_DEPLOY, SANITY_FIRST, MAX_PARALLEL_RUNS are configurable โ see Customization.
Workflow 2: Auto Research Loop ๐ (sleep & wake up to results)
> "Review my paper, fix what's wrong, repeat until it's good."
>
> GPT-5.5 reviews โ identifies weaknesses โ suggests experiments โ Claude Code writes scripts, deploys to GPU, monitors results, rewrites the paper โ all while you sleep. Just add your GPU server config to CLAUDE.md.
๐ Deep review โ GPT-5.5 xhigh reviews the current paper / claims / experiments and identifies weaknesses
๐ฉน Fix โ Claude implements the fixes (rewrites sections, adds baselines, or runs new experiments via /run-experiment); skips any experiment estimated > 4 GPU-hours and flags it for manual follow-up
๐ Re-evaluate โ collect results via /monitor-experiment, update paper, feed back to the reviewer
๐ Repeat โ until score โฅ POSITIVE_THRESHOLD (default 6/10) or MAX_ROUNDS (default 4) is hit; if context window fills mid-loop, the workflow auto-resumes from REVIEW_STATE.json
Show W2 loop diagram โ external review โ implement fixes / run experiments โ monitor results โ repeat until threshold
> ๐ก One-command shortcut:/auto-review-loop "your paper topic" runs this entire workflow automatically.
Show W2 usage examples, reviewer difficulty levels, and full safety guarantees โ topic/scope arguments, medium/hard/nightmare, 6 safety rules
What to pass as argument? A short topic or scope is enough โ the skill automatically reads your project's narrative docs (NARRATIVE_REPORT.md), memory files, experiment results, and prior reviews to build the full context for GPT-5.5. Examples:
/auto-review-loop "factorized gap in discrete diffusion LMs" โ broad topic, skill finds everything
/auto-review-loop "focus on Section 3-5, our CRF results are weak" โ targeted scope with hints
/auto-review-loop โ also works: skill reads project files and infers the topic
๐ฎ Reviewer Difficulty โ control how adversarial the reviewer is:
+ GPT reads repo directly via codex exec (Claude can't filter what it sees) + adversarial verification
Preparing for top venue, want maximum stress test
/auto-review-loop "topic" โ difficulty: nightmare # GPT reads your code and verifies claims itself
๐ก๏ธ Key safety features:
๐ MAX_ROUNDS = 4 โ prevents infinite loops; stops early if score threshold is met
โฑ๏ธ > 4 GPU-hour experiments skipped โ won't launch massive jobs; flags them for manual follow-up
๐ง Prefer reframing over new experiments โ when both can address a weakness, chooses the cheaper path
๐ช No hiding weaknesses โ explicit rule: "Do NOT hide weaknesses to game a positive score"
๐ง Fix before re-review โ must actually implement fixes before resubmitting; no empty promises
๐พ Compact recovery โ persists state (REVIEW_STATE.json) after each round. If the context window fills up and auto-compacts mid-loop, the workflow reads the state file and resumes from where it left off โ no human intervention needed
> โ๏ธ MAX_ROUNDS, score threshold, and GPU limits are configurable โ see Customization.
> One-command shortcut:/paper-writing "NARRATIVE_REPORT.md" runs this entire workflow automatically.
Input: A NARRATIVE_REPORT.md describing the research: claims, experiments, results, figures. The more detailed the narrative (especially figure descriptions and quantitative results), the better the output.
Output: A paper/ directory with LaTeX source, clean .bib (only cited entries), and compiled PDF. The PDF is labelled submission-readyonly when run at โ effort: max | beast (or explicit โ assurance: submission) andtools/verify_paper_audits.sh reports green on the three mandatory audits (proof-checker, paper-claim-audit, citation-audit); see Assurance Gate below. At the default balanced level, the output is a reviewed draft.
Show W3 feature details โ Claims-Evidence Matrix, figure modes, clean bib, Gemini API setup, ICLR end-to-end test
Key features:
๐ Claims-Evidence Matrix โ every claim maps to evidence, every experiment supports a claim
๐ Auto figure generation โ line plots, bar charts, comparison tables from JSON data
๐งน Clean bib โ automated filtering removes uncited entries (948โ215 lines in testing). Real BibTeX from DBLP/CrossRef instead of LLM-generated entries
๐ Flexible sections โ 5-8 sections depending on paper type (theory papers often need 7)
๐ GPT-5.5 review โ each step optionally reviewed by external LLM
๐ฏ Page verification โ pdftotext-based precise check that main body fits page limit
> โ ๏ธ Figure generation scope:/paper-figure auto-generates data-driven plots (training curves, bar charts, heatmaps) and comparison tables from JSON/CSV. For architecture diagrams and method figures: illustration: gemini (default) uses ClaudeโGeminiโNano Banana Pro for publication-quality diagrams; illustration: mermaid generates Mermaid diagrams for free; illustration: false skips AI figures entirely.
>
> Gemini API setup (for illustration: gemini): Get your API key at Google AI Studio, then set it as an environment variable: export GEMINI_API_KEY="your-key". Or add to your shell profile (~/.zshrc / ~/.bashrc). No other dependencies needed.
Tested end-to-end: Generated a 9-page ICLR 2026 theory paper (7 sections, 29 citations, 4 figures, 2 comparison tables) from a single NARRATIVE_REPORT.md โ zero compilation errors, zero undefined references.
Auto Paper Improvement Loop โจ
After Workflow 3 generates the paper, /auto-paper-improvement-loop runs 2 rounds of GPT-5.5 xhigh content review โ fix โ recompile, plus a final format compliance check, autonomously polishing the paper from rough draft to a reviewer-scored draft. Whether the result is tagged submission-ready is decided separately by the Phase 6 assurance gate (see Assurance Gate).
Show auto-paper-improvement benchmark โ Score Progression on a real ICLR 2026 theory paper (4/10 โ 8.5/10), plus Round 1/2/3 fix details
Score Progression (Real Test โ ICLR 2026 theory paper):
Final: 8 pages main body (ICLR limit: 9), 0 overfull hbox, ICLR-compliant. +4.5 points across 3 rounds.
Round 1 fixes (6 items)
CRITICAL โ Assumption-model mismatch: A boundedness assumption contradicted the model's distributional family. Replaced with a tail-compatible assumption and added formal truncation bridge.
CRITICAL โ Theory-practice gap: Theory assumes idealized encoders, experiments use learned nonlinear encoders. Softened "validate" โ "demonstrate practical relevance" and added explicit disclaimer.
MAJOR โ Missing quantitative metrics: Added parameter count table (latent vs total) with honest accounting of system cost.
MAJOR โ Theorem not self-contained: Added "Interpretation" paragraph listing all dependencies explicitly.
MAJOR โ Overclaim in novelty statement: Scoped a broad "first convergence guarantee" to precise conditions under which it holds.
MAJOR โ Notation confusion: Renamed a symbol that clashed with another key variable. Added Notation paragraph.
Round 2 fixes (4 items)
MAJOR โ Missing theory-aligned experiments: Added a synthetic validation subsection directly testing the two main theoretical predictions under controlled conditions.
MAJOR โ Overclaim softening: Replaced strong equivalence claims with appropriately hedged language across all files.
MAJOR โ Informal theoretical argument: Formalized an informal justification into a proper proposition with explicit error bounds.
MINOR โ Weak limitations: Expanded to explicitly list all assumptions and acknowledge missing standard evaluations.
> ๐ก Quick mode:/rebuttal โ quick mode: true stops after parsing + strategy (Phase 0-3). See what reviewers want before committing to a full draft.
> โ๏ธ VENUE, AUTO_EXPERIMENT, QUICK_MODE, MAX_STRESS_TEST_ROUNDS are configurable โ see Customization.
Three safety gates โ rebuttal will NOT finalize if any fails:
๐ Provenance โ every claim maps to paper/review/user-confirmed result. No fabrication.
๐ Commitment โ every promise is user-approved. No overpromising.
๐ Coverage โ every reviewer concern is tracked. Nothing disappears.
Workflow 5: Resubmit Pipeline ๐ (port a paper to a new venue, text-only)
Port a polished paper from venue A โ B under hard, non-overridable guardrails โ no new experiments ยท no bib edits ยท no framework changes ยท never overwrites prior submissions โ via physical isolation, a 5-layer anonymity check, soft-only audits, whitelist microedits, and a /kill-argument adversarial gate. Full flow + constraints โ docs/RESUBMIT_AND_TALK.md
/paper-talk turns an accepted paper into a talk: outline โ /paper-slides (Beamer + PPTX + speaker notes + Q&A) โ /slides-polish (per-page Codex visual pass) โ optional conference-ready audit gate. Sister to /paper-writing / /paper-poster-html. Full flow โ docs/RESUBMIT_AND_TALK.md
๐ Research Wiki โ Persistent Research Memory
> "Stop re-deriving. Start compounding." โ inspired by Karpathy's LLM Wiki
Without the wiki, ARIS is stateless โ every /idea-discovery starts from scratch. With the wiki, ARIS accumulates knowledge across the entire research lifecycle: papers read, ideas tested, experiments run, claims verified or invalidated.
The key insight: failed ideas are the most valuable memory. A researcher who knows what doesn't work generates better ideas than one starting from zero.
Setup:
> /research-wiki init # one-time, creates research-wiki/ in your project
That's it. Once initialized, the wiki works automatically.
Show the automatic wiki hooks โ what fires at /research-lit, /idea-creator, /result-to-claim, plus the re-ideation nudge
Hypothesis, status (proposed/failed/succeeded), failure notes, lessons
idea:001
๐งช Experiment
Metrics, verdict, hardware, duration
exp:001
๐ Claim
Testable statement + evidence status (reported/supported/invalidated)
claim:C1
Typed relationships (stored in graph/edges.jsonl):
paper --extends--> paper idea --inspired_by--> paper
paper --contradicts--> paper idea --tested_by--> experiment
paper --addresses_gap--> gap experiment --supports--> claim
paper --supersedes--> paper experiment --invalidates--> claim
Show Research Wiki spiral-learning example and manual subcommands โ failed ideas โ better ideas across 3 rounds; ingest / query / update / lint / stats
Spiral learning in action:
Round 1: read 15 papers โ wiki remembers โ idea A โ experiment โ FAIL
wiki records: "A fails because OOM at batch>32, loss diverges"
Round 2: /idea-creator reads wiki โ sees A failed โ generates idea D (avoids A's trap)
โ experiment โ PARTIAL SUCCESS
wiki records: "D works on small models, fails on large"
Round 3: /idea-creator reads wiki โ knows A failed + D partial โ generates idea F
(combines D's success with new approach) โ experiment โ SUCCESS ๐
Subcommands:
/research-wiki init # initialize wiki
/research-wiki ingest "paper title" โ arxiv: xxx # manually add a paper
/research-wiki query "topic" # rebuild query_pack.md
/research-wiki update idea:001 โ outcome: negative # update entity
/research-wiki lint # health check (orphans, contradictions, stale claims)
/research-wiki stats # overview (paper/idea/experiment/claim counts)
> ๐ Safe by design: All workflow hooks are guarded by if research-wiki/ exists. No wiki = no impact. Zero dependencies (pure Python stdlib). You choose when to enable it.
> "Analyze my usage patterns and improve your own skills."
Unlike Workflows 1โ4 which optimize research artifacts (papers, code, experiments), Workflow M optimizes the harness itself โ the SKILL.md instructions, default parameters, and convergence rules that govern how ARIS operates. Inspired by Meta-Harness (Lee et al., 2026).
Show Workflow M one-time setup and usage commands โ Claude Code hook install, /meta-optimize variants (project / per-skill / --global / apply)
Setup (one-time, in normal terminal):
mkdir -p .claude .aris/meta tools/meta_opt
cp Auto-claude-code-research-in-sleep/templates/claude-hooks/meta_logging.json .claude/settings.json
cp Auto-claude-code-research-in-sleep/tools/meta_opt/*.sh tools/meta_opt/
chmod +x tools/meta_opt/*.sh
claude # hooks active immediately
Usage (after 5+ workflow runs):
> /meta-optimize # analyze current project
> /meta-optimize "auto-review-loop" # focus on one skill
> /meta-optimize --global # analyze trends across ALL projects
> /meta-optimize apply 1 # apply recommended change #1
How it works:
๐ Passive logging โ Claude Code hooks silently record every skill invocation, tool call, failure, parameter override, and user prompt. Events are written to both project-level (.aris/meta/events.jsonl) and global (~/.aris/meta/events.jsonl, with a "project" tag) logs. Zero user effort.
๐ Pattern analysis โ /meta-optimize reads the log and identifies:
Parameters users override most often (bad defaults)
Tools that fail repeatedly in specific skills (missing error handling)
Review score plateaus (convergence rules too loose/tight)
Manual corrections users make (skill gaps)
๐ฉน Patch proposal โ generates minimal diffs to target SKILL.md files with data-backed justifications
๐ฌ Reviewer gate โ GPT-5.5 xhigh reviews each patch: does the evidence support it? could it hurt other users?
โ User approval โ only applied with explicit user consent. All changes are logged and reversible.
Show Workflow M diagram and "what gets optimized" component table โ event logs โ SKILL.md patches โ GPT-5.5 review โ user approval; prompts / defaults / convergence / error handling
What does NOT get optimized: research artifacts (papers, code, experiments) โ that's what W1โW4 do.
Skills involved:meta-optimize
> ๐ก This is a maintenance workflow, not part of the W1โW1.5โW2โW3โW4 research pipeline. Run it periodically, like git gc for your research harness.
โก Effort Levels
Every skill takes โ effort: lite | balanced | max | beast โ scaling breadth/depth (papers ยท ideas ยท pilots ยท rounds ยท seeds ยท audit depth) from ~0.4ร to ~5โ8ร; balanced is the default (zero change for existing users). What never changes at any level: Codex reasoning stays xhigh, DBLP/CrossRef citations on, reviewer independence on, experiment integrity on. ๐ Full spec + per-skill counts โ effort-contract.md
Assurance Gate (effort: max | beast)
A second axis, orthogonal to effort: assurance decides whether mandatory audits are load-bearing. lite/balanced โ draft (audits non-blocking โ current behavior, zero change); max/beast โ submission (paper-writing Phase 6 force-runs /proof-checker + /paper-claim-audit + /citation-audit in fresh threads and refuses the Final Report if tools/verify_paper_audits.sh exits non-zero). Escape hatch: โ effort: beast, assurance: draft. ๐ Full spec โ assurance-contract.md
๐งฟ Optional: GPT-5.5 Pro via Oracle
Add โ reviewer: oracle-pro to any reviewer-aware skill (/proof-checker, /research-review, /experiment-audit, /rebuttal, โฆ) to route review through GPT-5.5 Pro โ strongest reasoning for deep proof / code / experiment-design critique. Default stays Codex xhigh; Oracle not installed โ graceful fallback + warning (zero impact). ๐ Setup + per-skill examples โ reviewer-routing.md
9. โ๏ธ Setup
> ๐ New to ARIS?SETUP_GUIDE.md (ไธญๆ) gives a prescriptive 6-step walkthrough for macOS local + remote Linux GPU server with Claude Code + Codex MCP โ the recommended path. The section below is a quick reference; deeper GPU / customization / model-combo setup lives in the linked docs.
> If you only need Workflow 1 & 2 (idea discovery + auto review), LaTeX is not required.
10.2 Install Skills
> ๐ก Recommended: project-local flat symlink install (since 2026-04-20). Each ARIS skill is symlinked individually into .claude/skills/, so Claude Code's slash-command discovery picks them up. A manifest at .aris/installed-skills.txt tracks what ARIS installed โ uninstall and reconcile only ever touch managed entries, never your own skills.
>
> ๐ค Codex mirror route: keep Claude on install_aris.sh / smart_update.sh. For Codex-native project installs, use install_aris_codex.sh; for copied Codex installs, use smart_update_codex.sh.
# 1. Clone ARIS once to a stable location
git clone https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep.git ~/aris_repo
# 2. For each project that uses ARIS, attach via symlinks:
cd ~/your-paper-project
bash ~/aris_repo/tools/install_aris.sh
# โ creates one symlink per skill: .claude/skills/ โ ~/aris_repo/skills/
# โ writes manifest .aris/installed-skills.txt (tracks every entry ARIS installed)
# โ updates managed CLAUDE.md ARIS block (best-effort, compare-and-swap)
# โ re-runnable: rerun anytime to reconcile new/removed upstream skills
# 3. To update existing skills' content for ALL attached projects:
cd ~/aris_repo && git pull # symlinks resolve to live upstream โ content updates automatically
# 3a. To pick up newly added or removed upstream skills, rerun the installer:
bash ~/aris_repo/tools/install_aris.sh ~/your-paper-project # adds new symlinks, removes broken ones
# Other useful flags:
bash ~/aris_repo/tools/install_aris.sh --dry-run # show plan, no changes
bash ~/aris_repo/tools/install_aris.sh --uninstall # remove only managed symlinks (per manifest)
bash ~/aris_repo/tools/install_aris.sh --from-old # migrate from old nested .claude/skills/aris/
# Windows (PowerShell, no WSL required; creates flat per-skill junctions):
.\tools\install_aris.ps1 C:\path\to\your-paper-project -Platform claude
.\tools\install_aris.ps1 C:\path\to\your-codex-project -Platform codex
Why "git pull" alone isn't enough for new/removed skills: the flat layout uses one symlink per skill, so upstream additions/deletions don't propagate until the installer is re-run. The trade-off bought us Claude Code's automatic slash-command discovery (which only scans one directory level deep).
Migrating from the old nested install (pre-2026-04-20)
If you previously installed via install_aris.sh (which created .claude/skills/aris/ as a single nested symlink) or via smart_update.sh --target-subdir .claude/skills/aris, your slash commands probably weren't being auto-discovered by Claude Code. Migrate to the flat layout:
# Symlink-style legacy install:
bash ~/aris_repo/tools/install_aris.sh ~/your-project --from-old
# Copy-style legacy install (with possible local edits โ chose strategy explicitly):
bash ~/aris_repo/tools/install_aris.sh ~/your-project --from-old --migrate-copy keep-user
# โ keeps your nested .claude/skills/aris/ copy intact alongside the new flat install
bash ~/aris_repo/tools/install_aris.sh ~/your-project --from-old --migrate-copy prefer-upstream
# โ archives nested copy to .aris/legacy-copy-backup-/, then flattens
Alternative installs (advanced)
Project-local copy (no symlinks, useful for per-project skill edits):
mkdir -p ~/your-project/.claude/skills
bash ~/aris_repo/tools/smart_update.sh --project ~/your-project --apply
# Default --target-subdir is .claude/skills (flat), which is what Claude Code expects.
# (The old --target-subdir .claude/skills/aris is now deprecated โ see migration block above.)
Global install (one copy in your home dir, available to every project):
> Global install increases the risk of skill name collisions with other globally-installed packs. Use only if you don't mix ARIS with Superpowers / OpenHands / etc. โ otherwise prefer the project-local install above.
> ๐ก New Claude Code versions may not auto-create ~/.claude/skills/. If using global install, create it first: mkdir -p ~/.claude/skills/. The symlink installer handles directory creation automatically.
Optional: Codex Plugin for Code Review
codex-plugin-cc provides additional Codex capabilities that ARIS auto-detects when installed:
# In Claude Code:
/plugin marketplace add openai/codex-plugin-cc
/plugin install codex@openai-codex
/reload-plugins
/codex:setup
Where ARIS uses the plugin:
Skill
Workflow
What it does
/codex:review
Workflow 1.5
Review experiment code before GPU deployment
/codex:adversarial-review
Workflow 1.5
Adversarial code review (find edge cases, bugs)
/codex:rescue
Workflow 1.5 + 3
Auto-debug rescue โ when experiment or LaTeX compilation fails after 2 attempts, Codex independently diagnoses the root cause before the next retry
All plugin features are optional โ if not installed, ARIS falls back to Claude's own diagnosis. The plugin just adds a second pair of eyes.
> Note: ARIS's core cross-model review (paper scoring, idea evaluation, rebuttal stress test) still uses Codex MCP, which allows custom prompts. The plugin cannot replace this.
10.3 Update Skills
cd Auto-claude-code-research-in-sleep
git pull
# ๐ง Smart update (recommended) โ analyzes what's safe to update
bash tools/smart_update.sh # dry-run: shows what would change
bash tools/smart_update.sh --apply # apply: adds new + updates safe ones
# Manual options (if you prefer):
# cp -r skills/* ~/.claude/skills/ # Option A: overwrite all
# cp -rn skills/* ~/.claude/skills/ # Option B: only add new, keep yours
# cp -r skills/experiment-bridge ~/.claude/skills/ # Option C: specific skill
> ๐ก Smart update compares your local skills with upstream, detects personal customizations (server paths, API keys, etc.), and only updates skills that are safe to replace. Skills with your personal info are flagged for manual review.
10.4 Usage
# Workflow 1: Idea Discovery
> /idea-discovery "your research direction" # full pipeline
> /research-lit "topic" # just literature survey (all sources)
> /research-lit "topic" โ sources: zotero, web # mix and match sources
> /research-lit "topic" โ sources: deepxiv # DeepXiv-only progressive retrieval
> /research-lit "topic" โ sources: exa # Exa AI-powered web search with content extraction
> /research-lit "topic" โ arxiv download: true # also download top arXiv PDFs
> /arxiv "discrete diffusion" โ download # standalone arXiv search + download
> /idea-creator "topic" # just brainstorm
# Workflow 2: Auto Research Loop
> /auto-review-loop "your paper topic" # review โ fix โ repeat
> /research-review "your paper" # single deep review
# Workflow 3: Paper Writing
> /paper-writing "NARRATIVE_REPORT.md" # full pipeline
> /paper-plan "NARRATIVE_REPORT.md" # just outline
> /paper-compile "paper/" # just compile
# Full Pipeline
> /research-pipeline "your research direction" # Workflow 1 โ 2 โ 3 end-to-end
# Supporting Skills
> /run-experiment train.py --lr 1e-4 --epochs 100
> /analyze-results figures/*.json
> /monitor-experiment server5
10.5 ๐ Auto-Allow for Overnight Runs (Optional)
Skip permission prompts on overnight runs โ add a snippet to .claude/settings.local.json
To run the auto-review loop without clicking permission prompts, add to .claude/settings.local.json:
When the reviewer says "run an ablation", Claude Code writes the script and runs it on your GPU โ you just declare your server in CLAUDE.md. Three modes (Remote SSH ยท Local GPU ยท Vast.ai on-demand): config snippets + setup โ docs/GPU_SETUP.md (Vast.ai deep-dive โ Vast.ai guide). No GPU? Review/rewrite skills still work; experiment fixes are flagged for manual follow-up.
๐ Integrations (Optional)
Plug your library / vault / notifications into ARIS โ each auto-skips silently if unconfigured:
Zotero โ collections + annotations + BibTeX in /research-lit (before web search).
Obsidian + arXiv โ search your vault notes; arXiv is built-in, no setup.
Feishu / Lark โ mobile push + interactive approve/reject for overnight runs.
10. ๐๏ธ Customization
Skills are plain Markdown โ fork and tune them. Per-skill environment variables (GPU target, code review, reviewer routing, human checkpoints, paper-writing knobs) and parameter pass-through live in docs/CUSTOMIZATION.md.
11. ๐ Alternative Model Combinations
No Claude / OpenAI API? Swap in other providers โ same cross-model architecture. ARIS ships 10 alternative routes (Z.ai GLM, Alibaba Kimi/Qwen/MiniMax, free DeepSeek-V3.1 via ModelScope, OpenRouter as a pin-one-of-many reviewer backend, Codex-as-executor with Claude/Gemini reviewers, Google Antigravity). Full routing table + per-route setup in docs/MODEL_COMBINATIONS.md.
12. ๐ฌ Community
Domain-specific skills welcome! The core skills cover general research workflows, but every field has its own tools and patterns. We welcome PRs that add new skills for your domain โ EDA, bioinformatics, robotics, HPC, or anything else. Just add a skills/your-skill/SKILL.md and open a PR. See dse-loop for an example.
Join the WeChat group for discussion on Claude Code + AI-driven research workflows:
13. ๐ Citation
If you use ARIS in your research, please cite:
@article{yang2026aris,
title={ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration},
author={Yang, Ruofeng and Li, Yongcan and Li, Shuai},
journal={arXiv preprint arXiv:2605.03042},
year={2026}
}
Architecture & vision โ ๐ก @JingxuanKang: beyond code (training-check, result-to-claim, ablation-planner, watchdog, templates, session recovery), deeply shaped ARIS through discussions on compact mode, workflow state management, and the vision of autonomous research โ many of today's core features (structured project files, context-aware session recovery) grew out of these conversations.