Skip to main content

Module config

Module config 

Source
Expand description

Deduplication configuration.

This module defines configuration for the deduplication service, including per-namespace similarity thresholds and cache settings.

§Threshold Rationale

Per-namespace thresholds balance precision (avoiding false duplicates) against recall (catching true duplicates). The defaults are tuned based on:

§Decisions (0.92 - High Threshold)

Architectural decisions are high-value captures where even slightly different phrasings may represent distinct rationale. A false positive (marking a unique decision as duplicate) is worse than a false negative (allowing similar decisions).

Example: “Use PostgreSQL for persistence” vs “Use PostgreSQL for ACID guarantees” are semantically similar (~91%) but capture different reasoning.

§Patterns (0.90 - Standard Threshold)

Code patterns have moderate variation. Similar patterns often represent the same concept, but edge cases exist where context differs meaningfully.

§Learnings (0.88 - Lower Threshold)

Learnings are frequently paraphrased differently when rediscovered. A lower threshold catches these reformulations while still allowing genuinely distinct learnings.

Example: “TIL: Rust closures capture by reference by default” vs “Learned that closures in Rust borrow by default” are the same learning (~87% similar).

§Other Namespaces (0.90 - Default)

Unconfigured namespaces use 90% as a balanced default that works well for most content.

§Tuning Guidelines

SymptomAdjustment
Too many duplicates skippedLower threshold (e.g., 0.85)
Duplicate content still capturedRaise threshold (e.g., 0.95)
Short content triggers false positivesIncrease min_semantic_length
Same content captured repeatedly in sessionExtend recent_window

Structs§

DeduplicationConfig
Configuration for the deduplication service.