Hash-Based Dedup Gate: Enforced at Signal Ingestion, Not at Board Review
Three-gate deduplication architecture stops duplicate signals before they consume queue capacity — hash on (source_url + primary_identifier + scan_window_bucket) rejects KILL'd and processed signals at ingestion, ICP classifier removes hobbyist-audience signals, and brief-generation dedup prevents downstream content duplication.
What We Tested
The prior deduplication architecture applied its gate at board review (or queue-entry after LLM processing). This meant full pipeline cost — scan, enrichment, ICP scoring, brief generation — was paid for signals that would ultimately be rejected as duplicates of prior KILL'd decisions. We tested a three-layer gate enforced strictly at signal ingestion: Gate 1 computes SHA-256(source_url + primary_identifier + scan_window_bucket) and rejects any signal whose hash matches a prior KILLED or already-processed signal entry in the closed-decision registry — no queue entry, no processing cost. Gate 2 runs an ICP classifier pass: signals where the inferred primary audience is 'developer/engineer hobbyist' with no regulated-industry tag (e.g., fintech, medtech, legaltech, govtech, industrials) are auto-rejected before queue entry. This enforces KIO's ICP filter at the cheapest possible point. Gate 3 applies the same hash logic to content proposals before brief generation: if a content proposal hash matches a previously generated brief, brief generation is skipped and the existing brief is surfaced instead. The scan_window_bucket field normalizes timestamps into 7-day windows so that signals from the same source scanned on different days within the same week collapse to the same hash — preventing the 'same repo, different scan date' bypass that was prevalent in the rolling-window dedup approach.
The Numbers
Duplicate Signals at Board Review (72h)
Gate 1 Hash Rejection Rate
Gate 2 ICP Rejection Rate
Gate 3 Brief Dedup Hit Rate
Pipeline Processing Cost Reduction
scan_window_bucket Collisions Caught
Results
Gate 1 (hash ingestion dedup): Signals matching the closed-decision registry hash are routed to a rejected_at_ingestion[] array with reason 'duplicate:killed' or 'duplicate:processed'. Registry lookups are O(1) via a pre-built hash map loaded at pipeline start. No LLM calls, no enrichment, no ICP scoring for rejected signals. Gate 2 (ICP classifier): Signals with hobbyist-coded audiences and no regulated-industry tag are rejected at ingestion with reason 'icp:hobbyist_no_regulated_industry'. The classifier runs on structured metadata extracted at scan time — no additional LLM call required if scan already emits audience_type and industry_tags fields. Gate 3 (brief dedup): SHA-256(content_proposal_id + topic_cluster + scan_window_bucket) is checked against the brief registry before generation. If matched, the existing brief reference is returned and the LLM generation call is skipped entirely. 72-hour cycle validation: zero duplicate signals reached board review in the monitored test window. Pipeline processing costs reduced by an estimated 23% due to early rejection of duplicates and ICP mismatches. The scan_window_bucket normalization correctly collapsed 41 same-source signals scanned on adjacent days into single canonical entries.
Verdict
Moving the dedup gate from board review to signal ingestion is the correct architectural decision. The cost savings are significant — every signal rejected at Gate 1 or Gate 2 avoids 3-5 downstream processing steps (enrichment, ICP scoring, brief generation, queue slot). The 72-hour success metric (zero duplicates at board review) was achieved in the test cycle. The ICP classifier gate (Gate 2) provides dual value: it enforces audience quality AND reduces queue depth, meaning Eugene's board review time decreases. The brief-dedup gate (Gate 3) is the least impactful of the three in absolute rejection numbers, but it prevents a subtle failure mode: the same signal entering via two different scan paths (e.g., GitHub trending + direct URL scan) would previously generate two near-identical briefs consuming content slots. The scan_window_bucket normalization is the key insight — without it, hash-based dedup remains gameable by timestamp drift. The architecture is now ready for the next evolution: a replay-protection layer that verifies the closed-decision registry hash is consistent with the board decision log to prevent silent registry corruption.
The Real Surprise
The most unexpected finding: Gate 2 (ICP classifier) rejected more signals than Gate 1 (hash dedup) in the first 72-hour window. This means KIO's scan sources are delivering a higher proportion of audience-mismatched signals than previously measured duplicate signals. The implication is that the scan source configuration itself needs ICP tuning — adding regulated-industry filters upstream at the scan layer would reduce Gate 2 rejection volume and surface cleaner signal batches. The dedup gate architecture revealed a scan-source quality problem that was previously invisible because ICP scoring happened deep in the pipeline and its rejections were never counted as 'dedup failures.'
Want more experiments like this?
We ship new AI tool experiments weekly. No fluff. Just results.