Skip to content

ADR-009: Reduced Default Chunk Size

Accepted

The default chunk size affects:

  • Search precision: Smaller chunks = more precise retrieval
  • Context efficiency: Smaller chunks = less wasted context
  • Embedding quality: Embedding models have optimal input sizes
  • Chunking overhead: More chunks = more storage and processing

The original default of 4000 bytes was chosen conservatively. After production usage, it became clear that smaller chunks would improve the system.

  1. Large chunks dilute relevance: A 4000-byte chunk may contain one relevant sentence and much irrelevant content
  2. Context waste: Retrieved chunks often contain more text than needed
  3. Embedding quality: Longer text may exceed optimal embedding model context
  1. Search precision: Smaller chunks increase retrieval precision
  2. Context efficiency: Smaller chunks reduce wasted LLM context tokens
  3. Embedding model alignment: 2000 bytes aligns better with typical model token limits
  1. User feedback: Reports of imprecise search results
  2. Empirical testing: Better search quality observed with smaller chunks
  3. Backward compatibility: Existing databases should still work (they keep their chunk size)

Description: Change default from 4000 to 2000 bytes.

Technical Characteristics:

  • ~500-600 tokens per chunk (model-dependent)
  • Fits comfortably in embedding model context
  • Good balance of precision and coherence

Advantages:

  • More precise search results
  • Less wasted context in retrieved chunks
  • Better embedding quality
  • Still large enough for coherent units

Disadvantages:

  • More chunks per document
  • Slightly more storage overhead
  • May break some semantic units

Risk Assessment:

  • Technical Risk: Low. Simple constant change
  • Schedule Risk: Low. Minimal code change
  • Ecosystem Risk: Low. Backwards compatible

Description: Maintain status quo.

Technical Characteristics:

  • ~1000-1200 tokens per chunk
  • Current behavior preserved

Advantages:

  • No change required
  • Preserves larger semantic units

Disadvantages:

  • Continues precision issues
  • Wastes context on irrelevant content

Risk Assessment:

  • Technical Risk: None. No change
  • Schedule Risk: None. No change
  • Ecosystem Risk: Low. Status quo

Description: More aggressive reduction to 1000 bytes.

Technical Characteristics:

  • ~250-300 tokens per chunk
  • Very fine-grained retrieval

Advantages:

  • Maximum precision
  • Minimal context waste

Disadvantages:

  • May fragment semantic units
  • Many more chunks to manage
  • Higher storage overhead
  • May lose broader context

Risk Assessment:

  • Technical Risk: Medium. May be too aggressive
  • Schedule Risk: Low. Simple change
  • Ecosystem Risk: Low. Backwards compatible

Reduce the default chunk size from 4000 to 2000 bytes.

The implementation will:

  • Change DEFAULT_CHUNK_SIZE constant from 4000 to 2000
  • Existing databases retain their original chunk sizes
  • Users can override with --chunk-size flag
  1. Improved precision: Search results contain more focused, relevant content
  2. Better context efficiency: Less wasted tokens in LLM context
  3. Embedding alignment: 2000 bytes fits well within embedding model optimal ranges
  4. User satisfaction: Addresses feedback about imprecise results
  1. More chunks: Documents produce ~2x more chunks than before
  2. Re-chunking needed: Users wanting new default must reload documents
  3. Storage increase: Slightly more metadata per document
  1. Backward compatibility: Existing databases continue to work

The 2000-byte default provides a better balance of precision and coherence based on production usage feedback. Users who prefer larger chunks can still use --chunk-size 4000.

Mitigations:

  • Document the change in CHANGELOG
  • Provide migration guidance for users wanting to re-chunk
  • Keep --chunk-size flag for customization
  • Date: 2025-01-18
  • Source: v1.1.2 release based on user feedback
  • Related ADRs: ADR-004, ADR-008

Status: Compliant

Findings:

FindingFilesLinesAssessment
DEFAULT_CHUNK_SIZE = 2000src/chunking/mod.rs-compliant
—chunk-size flag availablesrc/main.rs-compliant
CHANGELOG documents changeCHANGELOG.mdv1.1.2compliant

Summary: Default chunk size reduced to 2000 bytes with CLI override available.

Action Required: None

Status: Superseded in practice

Findings:

FindingFilesLinesAssessment
DEFAULT_CHUNK_SIZE = 3000src/chunking/mod.rs-changed from 2000
MAX_CHUNK_SIZE = 50000src/chunking/mod.rs-reduced from 250000

Summary: In v1.1.2, the default chunk size was revised to 3,000 characters. The implementation had drifted from this ADR’s 2,000-byte target to 240,000 before the v1.1.2 correction. The maximum was also reduced to 50,000 (from 250,000). The CLI --chunk-size override remains available.

Action Required: None — the spirit of this ADR (reduce chunk size for better search precision) remains valid. The exact value (3,000) is documented in the CHANGELOG under v1.1.2.