Content-Hash Dedup Gate: Permanent Signal Registry + Pipeline Pause Trigger
Three-layer governance fix for KIO's research pipeline: SHA-256 fingerprint registry blocks previously-decided signals permanently, Eugene's decisions write back to prevent re-entry, and >3 duplicate clusters auto-halt the pipeline.
What We Tested
The existing repo-dedup.js gate used rolling time windows (24h/7d). Once those windows expired, a KILL'd or APPROVED signal could re-enter the pipeline as a 'new proposal.' Eugene was re-adjudicating repos he had already decided on — a trust-eroding, time-wasting governance failure. We built and deployed three gates to close this permanently: (1) a content-hash fingerprint gate using SHA-256(normalizedUrl + scanDate + signalType)[:20] checked against a persistent closed-decision registry (closed-decision-registry.json); (2) a decision write-back mechanism so Eugene's KILL/APPROVE decisions populate the registry immediately and block all future re-entry; (3) a pipeline pause trigger — if more than 3 duplicate signal clusters are detected in a single run, the pipeline halts and pages Engineering via Telegram before proceeding. The registry persists across pipeline restarts (unlike seen-repos.json which is session-scoped) and covers all closed states: KILL, NOISE, DELEGATE, APPROVED, done, cancelled.
The Numbers
Signal Registry Persistence
Decision Write-Back
Duplicate Cluster Detection
Pipeline Pause Trigger
Test Coverage
Registry Entries (seed)
Results
All three gates implemented and verified with 18 automated tests (18/18 passing). Gate 1 (fingerprint check): signals matching the closed-decision registry are routed to an archived[] array, never reaching the board. Gate 2 (decision write-back): recordDecision(signal, decision) and recordDecisionBatch() are wired into run-scan.js — Eugene's decisions become permanent blockers within the same run. Gate 3 (pipeline pause): checkGate() counts duplicate clusters per run; if dupeClusters > 3, the orchestrator sends a Telegram health alert with contentGatePaused: true and halts the cycle before any LLM analysis costs are incurred. The health report now exposes contentGateArchived, contentGateDupeClusters, and contentGateRegistrySize fields for monitoring. The registry is append-only and survives service restarts — once a decision is recorded, it cannot be overwritten by a newer scan of the same fingerprint.
Verdict
The content-hash gate closes the governance gap that rolling dedup windows left open. Previously-decided signals can no longer re-enter the board queue regardless of how much time has passed. The pipeline pause trigger protects Engineering from silent runaway duplicate storms. Eugene's approval decisions are now the authoritative seed data for the registry — backfilling the last 30 days of KILL decisions is the next manual step to fully populate the registry. The moonshot layer (public Research Integrity Dashboard showing dedup rate, signal freshness, decision velocity, and registry size) is scoped and ready but requires 2 weeks of clean gate operation before public exposure.
The Real Surprise
The most impactful discovery: the cluster detection logic revealed that GitHub's trending algorithm recycles the same repos across consecutive days with different scan timestamps. Without the cluster key (URL+type without date), these would produce different fingerprints and pass the date-scoped gate. The cluster key is what catches 'same repo, different scan day' patterns — and it's what triggers the pipeline pause when a noisy batch arrives. The cluster gate is more valuable than the fingerprint gate for day-to-day operational health.
Want more experiments like this?
We ship new AI tool experiments weekly. No fluff. Just results.