ADR-009: Reduced Default Chunk Size
ADR-009: Reduced Default Chunk Size
Section titled “ADR-009: Reduced Default Chunk Size”Status
Section titled “Status”Accepted
Context
Section titled “Context”Background and Problem Statement
Section titled “Background and Problem Statement”The default chunk size affects:
- Search precision: Smaller chunks = more precise retrieval
- Context efficiency: Smaller chunks = less wasted context
- Embedding quality: Embedding models have optimal input sizes
- Chunking overhead: More chunks = more storage and processing
The original default of 4000 bytes was chosen conservatively. After production usage, it became clear that smaller chunks would improve the system.
Current Limitations
Section titled “Current Limitations”- Large chunks dilute relevance: A 4000-byte chunk may contain one relevant sentence and much irrelevant content
- Context waste: Retrieved chunks often contain more text than needed
- Embedding quality: Longer text may exceed optimal embedding model context
Decision Drivers
Section titled “Decision Drivers”Primary Decision Drivers
Section titled “Primary Decision Drivers”- Search precision: Smaller chunks increase retrieval precision
- Context efficiency: Smaller chunks reduce wasted LLM context tokens
- Embedding model alignment: 2000 bytes aligns better with typical model token limits
Secondary Decision Drivers
Section titled “Secondary Decision Drivers”- User feedback: Reports of imprecise search results
- Empirical testing: Better search quality observed with smaller chunks
- Backward compatibility: Existing databases should still work (they keep their chunk size)
Considered Options
Section titled “Considered Options”Option 1: Reduce to 2000 bytes
Section titled “Option 1: Reduce to 2000 bytes”Description: Change default from 4000 to 2000 bytes.
Technical Characteristics:
- ~500-600 tokens per chunk (model-dependent)
- Fits comfortably in embedding model context
- Good balance of precision and coherence
Advantages:
- More precise search results
- Less wasted context in retrieved chunks
- Better embedding quality
- Still large enough for coherent units
Disadvantages:
- More chunks per document
- Slightly more storage overhead
- May break some semantic units
Risk Assessment:
- Technical Risk: Low. Simple constant change
- Schedule Risk: Low. Minimal code change
- Ecosystem Risk: Low. Backwards compatible
Option 2: Keep 4000 bytes
Section titled “Option 2: Keep 4000 bytes”Description: Maintain status quo.
Technical Characteristics:
- ~1000-1200 tokens per chunk
- Current behavior preserved
Advantages:
- No change required
- Preserves larger semantic units
Disadvantages:
- Continues precision issues
- Wastes context on irrelevant content
Risk Assessment:
- Technical Risk: None. No change
- Schedule Risk: None. No change
- Ecosystem Risk: Low. Status quo
Option 3: Reduce to 1000 bytes
Section titled “Option 3: Reduce to 1000 bytes”Description: More aggressive reduction to 1000 bytes.
Technical Characteristics:
- ~250-300 tokens per chunk
- Very fine-grained retrieval
Advantages:
- Maximum precision
- Minimal context waste
Disadvantages:
- May fragment semantic units
- Many more chunks to manage
- Higher storage overhead
- May lose broader context
Risk Assessment:
- Technical Risk: Medium. May be too aggressive
- Schedule Risk: Low. Simple change
- Ecosystem Risk: Low. Backwards compatible
Decision
Section titled “Decision”Reduce the default chunk size from 4000 to 2000 bytes.
The implementation will:
- Change
DEFAULT_CHUNK_SIZEconstant from 4000 to 2000 - Existing databases retain their original chunk sizes
- Users can override with
--chunk-sizeflag
Consequences
Section titled “Consequences”Positive
Section titled “Positive”- Improved precision: Search results contain more focused, relevant content
- Better context efficiency: Less wasted tokens in LLM context
- Embedding alignment: 2000 bytes fits well within embedding model optimal ranges
- User satisfaction: Addresses feedback about imprecise results
Negative
Section titled “Negative”- More chunks: Documents produce ~2x more chunks than before
- Re-chunking needed: Users wanting new default must reload documents
- Storage increase: Slightly more metadata per document
Neutral
Section titled “Neutral”- Backward compatibility: Existing databases continue to work
Decision Outcome
Section titled “Decision Outcome”The 2000-byte default provides a better balance of precision and coherence based on production usage feedback. Users who prefer larger chunks can still use --chunk-size 4000.
Mitigations:
- Document the change in CHANGELOG
- Provide migration guidance for users wanting to re-chunk
- Keep
--chunk-sizeflag for customization
Related Decisions
Section titled “Related Decisions”- ADR-004: Multiple Chunking Strategies - Chunking framework
- ADR-008: Hybrid Search - Chunk size affects search quality
- CHANGELOG v1.1.2 - Release notes documenting change
More Information
Section titled “More Information”- Date: 2025-01-18
- Source: v1.1.2 release based on user feedback
- Related ADRs: ADR-004, ADR-008
2025-01-20
Section titled “2025-01-20”Status: Compliant
Findings:
| Finding | Files | Lines | Assessment |
|---|---|---|---|
| DEFAULT_CHUNK_SIZE = 2000 | src/chunking/mod.rs | - | compliant |
| —chunk-size flag available | src/main.rs | - | compliant |
| CHANGELOG documents change | CHANGELOG.md | v1.1.2 | compliant |
Summary: Default chunk size reduced to 2000 bytes with CLI override available.
Action Required: None
2026-01-19
Section titled “2026-01-19”Status: Superseded in practice
Findings:
| Finding | Files | Lines | Assessment |
|---|---|---|---|
| DEFAULT_CHUNK_SIZE = 3000 | src/chunking/mod.rs | - | changed from 2000 |
| MAX_CHUNK_SIZE = 50000 | src/chunking/mod.rs | - | reduced from 250000 |
Summary: In v1.1.2, the default chunk size was revised to 3,000 characters. The implementation
had drifted from this ADR’s 2,000-byte target to 240,000 before the v1.1.2 correction. The maximum
was also reduced to 50,000 (from 250,000). The CLI --chunk-size override remains available.
Action Required: None — the spirit of this ADR (reduce chunk size for better search precision) remains valid. The exact value (3,000) is documented in the CHANGELOG under v1.1.2.