Skip to main content

Module deduplication

Module deduplication 

Source
Expand description

Deduplication service for pre-compact hook.

This module provides three-tier deduplication checking:

  1. Exact match: SHA256 hash comparison via tag search
  2. Semantic similarity: FastEmbed embeddings with cosine similarity threshold
  3. Recent capture: In-memory LRU cache with TTL-based expiration

The service implements short-circuit evaluation, exiting early on first match.

§Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    DeduplicationService                         │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────────────────┐ │
│  │ ExactMatch   │  │ Semantic     │  │ RecentCapture          │ │
│  │ Checker      │  │ Checker      │  │ Checker                │ │
│  │              │  │              │  │                        │ │
│  │ SHA256 hash  │  │ Embedding    │  │ LRU Cache with TTL     │ │
│  │ comparison   │  │ similarity   │  │ (5 min window)         │ │
│  └──────────────┘  └──────────────┘  └────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

§Example

use subcog::services::deduplication::{DeduplicationService, DeduplicationConfig};

let config = DeduplicationConfig::default();
let service = DeduplicationService::new(recall, embedder, config);

let result = service.check_duplicate("Use PostgreSQL for primary storage", Namespace::Decisions)?;
if result.is_duplicate {
    println!("Skipping duplicate: {:?}", result.reason);
}

Modules§

config 🔒
Deduplication configuration.
exact_match 🔒
Exact match deduplication checker.
hasher 🔒
Content hashing utility for deduplication.
recent 🔒
Recent capture deduplication checker.
semantic 🔒
Semantic similarity deduplication checker.
service 🔒
Deduplication service orchestrator.
types 🔒
Deduplication result types.

Structs§

ContentHasher
Content hasher for deduplication.
DeduplicationConfig
Configuration for the deduplication service.
DeduplicationService
Service for deduplication checking.
DuplicateCheckResult
Result of a deduplication check.

Enums§

DuplicateReason
The reason content was identified as a duplicate.

Traits§

Deduplicator
Trait for deduplication checking.