Skip to Main Content
AI-Tool-hub
Winner
InfrastructureInternal P0 fix: pipeline was surfacing duplicate repos to the board, degrading signal quality and trust

P0 Pipeline Fix: How We Stopped Duplicate Signals from Reaching the Board

4 surgical fixes to the KIO signal pipeline: 6-hour dedup gate, domain blocklist, batch kill, and content output dedup — all deployed, all tested, 39/39 passing.

SourcePublished Mar 24, 2026
1

What We Tested

The KIO signal pipeline was surfacing the same repositories to the board across consecutive scan cycles. Eugene (Founder) was receiving duplicate INVESTIGATE/BUY/KILL prompts for repos he had already adjudicated hours earlier. We identified 4 root causes and deployed surgical fixes: (1) the dedup window only covered 24h rolling state, missing same-session re-ingestion within 6h; (2) permanently-killed domains like gofr-dev/* had no persistent blocklist and could re-enter the pipeline; (3) batches with >50% stale signals were still surfaced to the board instead of being silently killed; (4) LLM content analysis outputs were not deduplicated, so two sources analyzing the same repo could produce near-identical opportunity writeups that both reached Eugene.

2

The Numbers

Test Suite

8 tests (regression only)39 tests (8 regression + 4 P0 + expansion)tests passing

Session Dedup Window

24h rolling only6h unconditional + 24h rollinghours

Domain Blocklist

None — permanent kills re-entered pipelineprefix-matched JSON blocklist (gofr-dev, etc.)permanent blocks

Batch Kill Gate

None — noisy batches surfaced to board>50% 6h dupes → batch discarded silentlybatch threshold

Content Output Dedup

None — duplicate LLM outputs reached EugeneJaccard > 0.7 on type+whatToBuild → mergedsimilarity threshold

Pipeline Trust Score

Degraded (duplicates reaching board)Restored — all P0 issues fixedboard confidence
3

Results

All 4 fixes deployed and verified. 39/39 automated tests pass, including 4 new P0-specific regression tests. Fix 1 (6h session gate): any repo seen in the last 6 hours is now blocked unconditionally in Pass 1 of filterSeenRepos() — previously this only triggered for stale+repeat entries (seenCount >= 2), allowing single re-submissions through. Fix 2 (domain blocklist): domain-blocklist.json with prefix matching blocks github.com/gofr-dev/*, gofr-dev, and gofr.dev permanently before any other filter runs. Fix 3 (batch kill): if more than 50% of an incoming batch are 6h session duplicates AND batch size >= 10, the entire batch is discarded silently — not surfaced to the board. run-scan.js short-circuits analysis and reporting, sending only a health report. Fix 4 (content output dedup): dedupOutputs() in analyzer.js applies Jaccard word-set similarity (threshold 0.7) to type+whatToBuild fields after LLM analysis — near-duplicate opportunity writeups are merged, highest-scored survives.

Verdict

Pipeline is clean. All 4 P0 fixes are live on the researcher service. Eugene now only receives net-new, non-duplicate signals that have not been seen in the last 6h, are not from permanently-killed domains, and represent a batch where at least 50% of signals are fresh. The moonshot layer (self-healing pipeline that learns from adjudication history) is scoped and planned but not yet built — 6 months of board decision data would allow auto-weighting repos by KILL/INVESTIGATE/BUY history, creating a proprietary signal filter.

The Real Surprise

The batch kill fix revealed something unexpected: on noisy days (re-trending repos, HackerNews recycling old posts), more than 50% of a batch can be 6h dupes. Without the batch kill gate, these batches were consuming full LLM analysis budget and filling Eugene's board with noise. The gate is now the single most impactful fix for day-to-day pipeline health.

Want more experiments like this?

We ship new AI tool experiments weekly. No fluff. Just results.