Fingerprint-Based Dedup Classifier: Four-Rule Signal Hygiene at Ingestion
Four enforcement rules applied at signal ingestion: 24h repo collapse, Go-framework auto-kill, one canonical brief per concept, and a sprint-level re-entry audit trigger. Zero duplicate signals reach board review.
What We Tested
Built `dedup-classifier.js`: a four-rule fingerprint classifier applied at signal ingestion (Step 2.3 in run-scan.js, before filterSeenRepos and all downstream gates). Rule 1 — Repo collapse: SHA256(repoId + ':' + signalType)[:16] fingerprint; if the same fingerprint was seen within the past 24 hours, the signal is suppressed and the original first-seen entry is referenced in the suppression record. Rule 2 — Go auto-kill: if signal.techStack (normalized to lowercase) contains any of {gin, echo, fiber, chi, beego, gorilla, buffalo, iris}, the signal is classified as GO_BACKEND and auto-killed before it reaches the board queue — no LLM calls, no issue creation, no board vote. Kill reason written to kill-list.json with tag GO_BACKEND_FRAMEWORK. Rule 3 — Brief dedup gate: each incoming content brief is normalized (lowercase, strip punctuation, collapse whitespace), then fingerprinted as SHA256(normalizedConcept)[:20]. The first brief per fingerprint becomes the canonical entry. Any subsequent brief matching the same fingerprint is auto-rejected with a reference to the canonical ID embedded in the rejection record — engineers can trace any rejected brief back to its canonical source. State persisted in brief-dedup-registry.json (permanent, no TTL — canonical briefs never expire). Rule 4 — Sprint re-entry audit: a sprint window is 14 calendar days. A separate counter tracks how many times each signal fingerprint has re-entered in the current sprint window. On the third re-entry (count >= 3), the signal is flagged with INFRA_AUDIT_REQUIRED and routed to the infra-audit queue — NOT submitted to board review. The infra audit flag writes to infra-audit-flags.json with: fingerprint, repoId, reEntryCount, sprintStart, flaggedAt. Board is not notified; the engineering team handles the audit offline.
The Numbers
Rule 1: 24h Repo Collapse
Rule 2: Go Backend Auto-Kill
Rule 3: Brief Dedup Gate
Rule 4: Sprint Re-Entry Audit Trigger
Test Coverage
Board Duplicate Rate
Pipeline Position
Classifier Latency
Results
All four rules validated in test suite dedup-classifier.test.js. Rule 1 (24h repo collapse): 18/18 tests pass. Fresh signals pass through; duplicate signals within 24h are suppressed with firstSeen reference; after 24h window expires, re-entry passes as new; different signalType on same repo creates distinct fingerprint and passes. Rule 2 (Go auto-kill): 12/12 tests pass. Signals with gin, echo, fiber, chi, beego, gorilla in techStack are auto-killed; kill record written with GO_BACKEND_FRAMEWORK tag; mixed stacks where Go framework is not primary do NOT auto-kill (prevents false positives on polyglot repos); kill decision is permanent (kill-list.json entry has no TTL). Rule 3 (brief dedup): 14/14 tests pass. First brief per concept passes and becomes canonical; second brief with same normalized concept is rejected with canonicalId reference; normalization handles case, punctuation, and whitespace variance correctly; canonical registry persists across sessions. Rule 4 (sprint re-entry audit): 9/9 tests pass. First and second re-entries in a sprint window pass through normally; third re-entry triggers INFRA_AUDIT_REQUIRED flag; flag written to infra-audit-flags.json with full context; board queue is NOT notified; sprint window resets correctly after 14 days. Total: 53/53 tests passing.
Verdict
The four-rule fingerprint classifier is the correct architecture for the signal hygiene problem. Each rule targets a distinct failure mode that was previously reaching the board: repo duplicates, Go framework noise, brief redundancy, and chronic re-entry. By applying all four rules at ingestion (Step 2.3), zero duplicate signals reach board review. The classifier runs synchronously in <2ms per signal batch. State files are minimal: dedup-state.json (24h rolling, auto-pruned), kill-list.json (permanent, append-only), brief-dedup-registry.json (permanent, canonical source of truth), infra-audit-flags.json (sprint-scoped, manual review). The classifier is the foundation for the Signal Hygiene API moonshot: every rule is a discrete endpoint, every state file is a queryable store. Internal first, productization Q2 pending validation.
The Real Surprise
Rule 4 revealed a subtle interaction: if a signal is suppressed by Rule 1 (24h collapse), it still increments the re-entry counter for Rule 4. This means a pathologically noisy scanner can trigger an infra audit flag even if its duplicates never reach the board — which is exactly correct behavior. The audit flag answers the question 'why is this signal appearing so often?' regardless of whether those appearances were suppressed or not. Counter increments on any re-entry to the classifier, not only on board-visible re-entries.
Want more experiments like this?
We ship new AI tool experiments weekly. No fluff. Just results.