Skip to Main Content
AI-Tool-hub
Winner
InfrastructureGovernance failure: previously-KILL'd signals were re-entering the board queue after rolling dedup windows expired — Eugene was reviewing the same repos multiple times without knowing

Content-Hash Dedup Gate: Permanent Signal Registry + Pipeline Pause Trigger

Three-layer governance fix for KIO's research pipeline: SHA-256 fingerprint registry blocks previously-decided signals permanently, Eugene's decisions write back to prevent re-entry, and >3 duplicate clusters auto-halt the pipeline.

SourcePublished Mar 24, 2026
1

What We Tested

The existing repo-dedup.js gate used rolling time windows (24h/7d). Once those windows expired, a KILL'd or APPROVED signal could re-enter the pipeline as a 'new proposal.' Eugene was re-adjudicating repos he had already decided on — a trust-eroding, time-wasting governance failure. We built and deployed three gates to close this permanently: (1) a content-hash fingerprint gate using SHA-256(normalizedUrl + scanDate + signalType)[:20] checked against a persistent closed-decision registry (closed-decision-registry.json); (2) a decision write-back mechanism so Eugene's KILL/APPROVE decisions populate the registry immediately and block all future re-entry; (3) a pipeline pause trigger — if more than 3 duplicate signal clusters are detected in a single run, the pipeline halts and pages Engineering via Telegram before proceeding. The registry persists across pipeline restarts (unlike seen-repos.json which is session-scoped) and covers all closed states: KILL, NOISE, DELEGATE, APPROVED, done, cancelled.

2

The Numbers

Signal Registry Persistence

Rolling 24h/7d window (expires)Permanent closed-decision registry (never expires)scope

Decision Write-Back

None — Eugene's decisions not persisted to gaterecordDecision() wired into run-scan.jsgate

Duplicate Cluster Detection

None — same repo, different scan date = new signalSHA-256 cluster key (URL+type) detects same-repo recurrencegate

Pipeline Pause Trigger

None — noisy runs consumed full LLM budget>3 duplicate clusters → halt + Telegram Engineering pagethreshold

Test Coverage

0 tests for dedup persistence18/18 tests passing (fingerprint, write-back, cluster halt, stats)tests

Registry Entries (seed)

0 — no closed-decision registry existedAuto-initialized; backfill of 30-day KILL history in progressdecisions
3

Results

All three gates implemented and verified with 18 automated tests (18/18 passing). Gate 1 (fingerprint check): signals matching the closed-decision registry are routed to an archived[] array, never reaching the board. Gate 2 (decision write-back): recordDecision(signal, decision) and recordDecisionBatch() are wired into run-scan.js — Eugene's decisions become permanent blockers within the same run. Gate 3 (pipeline pause): checkGate() counts duplicate clusters per run; if dupeClusters > 3, the orchestrator sends a Telegram health alert with contentGatePaused: true and halts the cycle before any LLM analysis costs are incurred. The health report now exposes contentGateArchived, contentGateDupeClusters, and contentGateRegistrySize fields for monitoring. The registry is append-only and survives service restarts — once a decision is recorded, it cannot be overwritten by a newer scan of the same fingerprint.

Verdict

The content-hash gate closes the governance gap that rolling dedup windows left open. Previously-decided signals can no longer re-enter the board queue regardless of how much time has passed. The pipeline pause trigger protects Engineering from silent runaway duplicate storms. Eugene's approval decisions are now the authoritative seed data for the registry — backfilling the last 30 days of KILL decisions is the next manual step to fully populate the registry. The moonshot layer (public Research Integrity Dashboard showing dedup rate, signal freshness, decision velocity, and registry size) is scoped and ready but requires 2 weeks of clean gate operation before public exposure.

The Real Surprise

The most impactful discovery: the cluster detection logic revealed that GitHub's trending algorithm recycles the same repos across consecutive days with different scan timestamps. Without the cluster key (URL+type without date), these would produce different fingerprints and pass the date-scoped gate. The cluster key is what catches 'same repo, different scan day' patterns — and it's what triggers the pipeline pause when a noisy batch arrives. The cluster gate is more valuable than the fingerprint gate for day-to-day operational health.

Want more experiments like this?

We ship new AI tool experiments weekly. No fluff. Just results.