OPEN-SOURCE · DEVTOOLS 2026

VIBE Framework

Open source Claude Code plugin that systematically investigates the failure modes of AI-generated code in production: 22 mechanical constraints, 11 audit agents in isolated contexts, 14 domain skills. Every component empirically validated, revocable.

AI-generated code in production has systematic, under-discussed failure modes: sycophancy that becomes technical error, file:line citations fabricated by the model, prototypes that pass the demo and break at the first edge case, loss of user corrections between sessions. These are problems of the model, not the user — and they don't fix by refining prompts.

  1. Eight domain skills: security (Heimdall), testing with 8 Playwright personas (Emmet), UI with anti-AI-pattern constraints (Seurat), SEO+GEO (Ghostwriter), CRO with competitor benchmarking (Baptist), programmatic video (Orson), Office+PDF documents (Scribe), meta-skill for creating and auditing skills (Forge)
  2. 22 hook handlers across 9 Claude Code lifecycle events, written as regex and exit codes. Four intervention categories: anti-AI rhetorical drift (rhetoric guard 87 patterns, oracle gate for file:line claims, side-effect verify, pragmatic priming), security and scope (shell-level blocking of destructive commands and force push, scope-guard cross-project, post-edit scans on 31 credential and injection patterns), read discipline (read-discipline, read-before-edit), code quality (lint, complexity watch, ADR surface)
  3. Eleven agents: 4 general-purpose (reviewer, researcher, decomposer for atomic decomposition, pragmatic) and 7 domain audit, in isolated worktrees with persistent memory in `.claude/agent-memory/` — context separation acts as experimental control for self-review bias
  4. Audit orchestrator (`/vibe:audit`): delta analysis between successive audits, regression detection, project-rule proposals when the same issue recurs 3+ times
  5. Per-skill empirical model assignment: Opus 4.7 / Opus 4.6 / Sonnet 4.6 / Haiku 4.5 based on blind A/B benchmarks (Tessl-style 880-eval). Validation repeated every release; rotation when the numbers stop holding up
  6. Shared competitor research protocol across Ghostwriter, Seurat, and Baptist in 5 default languages (EN, ZH, ES, PT, FR), 11 with `--global`, 30-day cache
  7. Components removed in the 5.x cycle after auditing their actual output files (correction-capture, auto-dream, tips-engine, cost-tracker): 19 days of data, 50% FP rate, zero consolidations — removal is part of the methodology
  8. 309 automated tests covering plugin structure, skills, agents, hooks (security, lint, scan, complexity watch, oracle gate, ADR surface, Grep/Glob enrichment), 31 security patterns, frontmatter, scope-guard, v1 migration

Open source framework released as a Claude Code plugin, MIT, at v5.7.0 since May 2, 2026. Systematically investigates the failure modes of AI-generated code in production — sycophancy, fabricated citations, rhetorical drift, inter-session amnesia, security regressions — through mechanical constraints, audits in isolated contexts, and empirical validation for every component. The revocability of choices is part of the methodology.

Code generation through interaction with a language model — vibe coding — is today the fastest-growing form of software writing. Its failure modes, less discussed than its diffusion, are systematic: sycophancy that becomes a technical error, file:line citations that don't match the session's actual tool calls, prototypes that pass the demo and break at the first production edge case, loss of user corrections between successive sessions.

VIBE Framework is a systematic investigation into a specific question: which mechanical constraints, interposed between a language model and the code it produces in production, reduce these failure modes in a verifiable way. The underlying hypothesis is that reduction doesn't come from refining prompts — which the model can ignore — but from installing gates: regex, exit codes, agents that evaluate output in isolated contexts. The framework is the materialization of that hypothesis: open source plugin for Claude Code, MIT, at v5.7.0 since May 2, 2026, with public benchmark fixtures (tests/model-validation/) that measure its hold on every release.

Three principles orient the methodology, consistent since v3. Market intelligence over guesswork: competitor research — across 5 default languages (EN, ZH, ES, PT, FR), 11 with --global — precedes the generation of copy, design, or conversion funnel design. Process discipline over knowledge: skills don't add knowledge to the model — it already has it — but enforce measurable reasoning steps (audience modeling, generation of multiple options, anti-AI-pattern detection) before delivery. Mechanical quality gates: 22 regex/exit-code hooks across 9 Claude Code lifecycle events. Validation is continuous: in the 5.x cycle four hooks from previous versions (correction-capture, auto-dream, tips-engine, cost-tracker) were removed after auditing their actual output files — 19 days of data, 50% false positive rate, zero consolidations. The ability to revoke components that don't hold up against data is part of the methodology.

The constraints are specific and measurable. One hook intercepts file:line citations not present in the session's tool calls and blocks. Another captures write promises ("I'll save X") not followed by the actual invocation. A third prevents a session scoped to one project from reading .env files in sibling projects. A fourth, on the third consecutive error repetition, forces replanning instead of retry. In parallel, eleven agents — four general-purpose, seven domain audit — operate in isolated worktrees with persistent memory: context separation functions as experimental control, the reviewer who hasn't seen the implementation cannot confirm its own self-review bias. The /vibe:audit orchestrator compares successive audits, identifies regressions, proposes project rules when the same issue recurs 3+ times. Per-skill model assignment — Opus 4.7 for creative tasks, Sonnet 4.6 for structured execution, Haiku 4.5 for high-volume search — is validated each release through blind A/B benchmarks with a Tessl-style 880-evaluation rubric: an empirical choice, revocable when the numbers stop holding up.