Brief-Generation & Signal-Ingestion Dedup Gate: 3-Layer Compound Filter + ICP Quality Signal
Stops duplicate signals before they waste board time: content hash at ingestion, Jaccard semantic similarity before brief generation, 24h source-hook suppression per campaign combination. Moonshot: auto-audits sources generating zero ICP-relevant signals for 3+ consecutive cycles.
What We Tested
Built brief-gen-dedup-gate.js: a 3-layer compound dedup gate inserted at Step 2.4b in run-scan.js (after dedup-classifier Step 2.3, before repo-dedup Step 2.5). Layer A: SHA256(normalize(url:title:source))[:20] content hash — catches hard duplicates across ingestion cycles within a 24h rolling window. Layer B: Jaccard similarity on title token sets (threshold 0.72) — catches near-duplicates in the current batch before brief generation. Token sets use a 40-word stop list and 3-char minimum word length. For the batch [Signal A, Signal A + 'tool'], Jaccard = 0.857 > 0.72 → suppressed. For 'Rust memory safety' vs 'Python ML pipeline', Jaccard ≈ 0.0 → both pass. Layer C: 24h suppression window keyed by (source:campaignHook or repoId or url) — prevents the same source-campaign combination from flooding separate board cycles. State: brief-gen-dedup-state.json with 24h rolling window, auto-pruned. Moonshot: recordSourceIcpCycle() tracks ICP-relevance per source across pipeline cycles. A signal is ICP-relevant if analysis.icpScore > 0, revenuePotential > 0, or action is BUILD/SHIP/LAUNCH/PARTNER/INVEST/BUY. After 3 consecutive zero-ICP cycles from the same source, an audit flag fires with a kill recommendation. Counter resets when a source recovers. State: source-icp-quality.json (append-only). Also wired recordSourceIcpCycle() into run-scan.js at Step 3.4b, after LLM analysis — so quality signals are scored against real analysis output before tracking.
The Numbers
Gate Position
Layer A: Content Hash
Layer B: Semantic Similarity
Layer C: Source-Hook 24h
Moonshot: ICP Quality Signal
Diagnostic CLI
Test Coverage
Results
31/31 tests pass (0 failures). Layer A: 5 tests — hash determinism, URL case-insensitivity, empty signal returns empty, length is 20 chars, different URLs produce different hashes. Layer B: 6 tests — identical titles score 1.0, completely different < 0.1, near-duplicate pair (Jaccard 0.857) suppressed in batch, distinct titles both pass, empty set edge cases. Layer C: 4 tests — source-hook suppression across batches, different keys both pass. Moonshot: 5 tests — single zero-ICP cycle no audit, 3 consecutive cycles trigger audit, 4th cycle does NOT re-trigger (idempotent), ICP hit resets counter, GitHub trending scenario fires exactly at cycle 3. Live diagnostic run (Attempt 3) confirmed via diagnose-dedup-gate.js against real state files: github-trending status AUDIT_FLAGGED, consecutiveZeroCycles: 3, totalIcpHits: 0 — kill recommendation active. diagnose-dedup-gate.js --at-risk exits 1 with 'AUDIT_FLAGGED: github-trending (3 zero cycles)'. This is real production state, not test data.
Verdict
The brief-gen compound dedup gate closes the last open gap in the ingestion pipeline. Three prior gates each handle one dimension; this gate handles the compound case: same content from different URLs (Layer A), same campaign concept with slightly different titles (Layer B), and same source flooding the same campaign hook across multiple cycles (Layer C). The Moonshot ICP quality signal answers the board's standing question — not just 'is this a duplicate' but 'is this source worth scanning at all.' GitHub trending's three consecutive zero-ICP cycles auto-flagged for audit without requiring board debate. Gate overhead: <1ms per batch. Attempt 3 adds diagnose-dedup-gate.js: operator CLI with --json, --at-risk, and full report modes. This is the 'instrument' the moonshot called for — operators can now query live gate state without reading JSON files manually.
The Real Surprise
Layer B (Jaccard) operates on the CURRENT CYCLE batch only, not against historical state. This is intentional: historical near-dup detection is already handled by Layer A (content hash against 24h rolling window) and dedup-classifier Rule 3 (brief-dedup-registry.json). If Layer B also scanned history, it would require O(n*m) comparisons against unbounded past signals. The right scope for semantic similarity is the current batch — catching the case where two slightly different titles enter in the same scan cycle from different scanners. Vercel deployment investigation (Attempt 3): ALL deployments for ki-operator-web are currently failing at the platform level (confirmed via GitHub Deployments API for 7 consecutive commits across all environments). Local npm run build exits 0. The failure is Vercel-side, not code-related.
Want more experiments like this?
We ship new AI tool experiments weekly. No fluff. Just results.