fully local · offline · zero false positives

Staleguard

Guard your docs against drift

Catches the places where your CLAUDE.md, READMEs, and *.md docs claim something the code no longer backs up. Checks docs against the actual codebase and reports what's stale, wrong, or missing.

$ brew install Arthur920/tap/staleguard

The architecture

Three layers, escalating only when needed

Each layer is cheaper and higher-signal than the next, so most drift is caught before any model runs. Layer 1 is instant and needs nothing; Layers 2–3 run local ONNX models — code never leaves the machine.

I

Deterministic

Paths exist? Commands real? Config keys present? Architecture rules parsed from prose, checked against the real import graph. No ML — tuned for zero false positives.

instant · no model~1.2s on 330k LOC
II

Retrieval

For each surviving claim, local embeddings fetch the most-relevant code chunks — symbol-aligned via tree-sitter, with an optional reranker.

local embeddingsjina · on-device
III

Verification

A code-aware NLI cross-encoder judges (evidence, claim) → supported · contradicted · unverifiable, each with a confidence.

UniXcoder fine-tune~0.14s / claim

Underneath sits a drift ledger — makes runs incremental, scores alignment, and gates CI on regressions.

What it catches

Six classes of documentation drift

// REFERENCES

Broken references

  • file/dir paths quoted in docs that don't exist
  • commands (npm run, make) with no matching script or target
  • env vars & flags documented but never read
  • qualified code refs that resolve to no symbol
// ARCHITECTURE

Architecture violations

  • forbidden imports — "controllers must not import db"
  • layering — "domain depends on nothing"
  • independence — "core is independent of infra"
  • forbidden symbols outside their allowed module
// BEHAVIOR

Behavioral contradictions

  • a local NLI cross-encoder judges prose the rules can't
  • verdicts: supported / contradicted / unverifiable
  • claims ground to symbols — a verdict re-opens when that code changes
// COVERAGE

Coverage gaps

  • public code surface that no doc describes
  • risk-ranked by fan-in, churn, and complexity
// DIAGRAMS

Diagram coherence

  • Mermaid / PlantUML / Graphviz diffed against the real graph
  • phantom edges, stale boxes, missing arrows
// DRIFT

Drift over time

  • --diff <ref> re-checks only what changed
  • per-module & repo-wide alignment score
  • CI regression gate; fingerprint staleness

The Layer 3 judge

A code-aware NLI model, evaluated against real targets

The default judge is Arthur920/staleguard — a microsoft/unixcoder-base fine-tune. Code-aware, so real code stays in-distribution as the premise. The alert class — contradictions — is what we optimize for.

Contradiction precision
87.6%
low alert fatigue — flags you can trust
Contradiction recall
89.9%
catches ~9 in 10 contradictions
contradiction F1
0.887
neutral F1
0.843
entailment F1
0.690
macro F1
0.807
baseline* F1
0.386
fine-tuned (staleguard) *roberta-large-mnli baseline, macro F1

Get started

Install, run, gate CI, wire into agents

The default build gives you Layer 1 — the deterministic, zero-false-positive core that needs no models. Add the ml feature for Layers 2–3.

Homebrew, an install script, or from source — all give you the deterministic core. Then run check on the full repo.

shell
# Homebrew (macOS / Linux)
$ brew install Arthur920/tap/staleguard

# or from source
$ cargo install --git https://github.com/Arthur920/Staleguard

$ staleguard check                 # full repo, Layer 1

Layers 2–3 run local ONNX models behind the ml feature (prebuilt binaries omit it — the deps are large). Both routes compile from source, then fetch models at runtime.

shell
# Homebrew — compiles with the ml feature
$ brew install Arthur920/tap/staleguard-ml

# or with cargo
$ cargo install --git https://github.com/Arthur920/Staleguard --features ml

$ staleguard setup                 # fetch + load every model, offline thereafter
$ staleguard check --layer 3       # all three layers

staleguard check exits non-zero on any finding or a score regression — a drop-in for any pipeline. Commit a baseline on main, then gate PRs against it.

.github/workflows
# once, on the base branch — records the alignment baseline
$ staleguard check --write-ledger

# in CI on each PR — fail only if alignment regressed
$ staleguard check --fail-on-regression --format json

Staleguard speaks --format json, so any coding agent can run it and read findings back — directly via shell, or exposed as an MCP check_doc_drift tool.

agent
# let the agent call it directly
$ staleguard check --format json --diff main

# a good standing instruction in CLAUDE.md:
#   "After editing code or docs, run staleguard check
#    --format json and fix any reported drift."

Environment overrides

VariableEffect
STALEGUARD_NLI_REPONLI judge model repo (default Arthur920/staleguard)
STALEGUARD_NLI_THRESHOLD
STALEGUARD_NLI_MARGIN
decision thresholds — how far contradiction must out-score entailment
STALEGUARD_NLI_MAX_CLAIMSper-run claim budget (default 300; 0 = no cap)
STALEGUARD_EMBED_REPOLayer 2 embedding model
STALEGUARD_RERANK_REPOoptional reranker
STALEGUARD_ORT_THREADSONNX intra-op threads (default: all cores)