ADR-009: Reduced Default Chunk Size

Status

Accepted

Context

Background and Problem Statement

The default chunk size affects:

Search precision: Smaller chunks = more precise retrieval
Context efficiency: Smaller chunks = less wasted context
Embedding quality: Embedding models have optimal input sizes
Chunking overhead: More chunks = more storage and processing

The original default of 4000 bytes was chosen conservatively. After production usage, it became clear that smaller chunks would improve the system.

Current Limitations

Large chunks dilute relevance: A 4000-byte chunk may contain one relevant sentence and much irrelevant content
Context waste: Retrieved chunks often contain more text than needed
Embedding quality: Longer text may exceed optimal embedding model context

Decision Drivers

Primary Decision Drivers

Search precision: Smaller chunks increase retrieval precision
Context efficiency: Smaller chunks reduce wasted LLM context tokens
Embedding model alignment: 2000 bytes aligns better with typical model token limits

Secondary Decision Drivers

User feedback: Reports of imprecise search results
Empirical testing: Better search quality observed with smaller chunks
Backward compatibility: Existing databases should still work (they keep their chunk size)

Considered Options

Option 1: Reduce to 2000 bytes

Description: Change default from 4000 to 2000 bytes.

Technical Characteristics:

~500-600 tokens per chunk (model-dependent)
Fits comfortably in embedding model context
Good balance of precision and coherence

Advantages:

More precise search results
Less wasted context in retrieved chunks
Better embedding quality
Still large enough for coherent units

Disadvantages:

More chunks per document
Slightly more storage overhead
May break some semantic units

Risk Assessment:

Technical Risk: Low. Simple constant change
Schedule Risk: Low. Minimal code change
Ecosystem Risk: Low. Backwards compatible

Option 2: Keep 4000 bytes

Description: Maintain status quo.

Technical Characteristics:

~1000-1200 tokens per chunk
Current behavior preserved

Advantages:

No change required
Preserves larger semantic units

Disadvantages:

Continues precision issues
Wastes context on irrelevant content

Risk Assessment:

Technical Risk: None. No change
Schedule Risk: None. No change
Ecosystem Risk: Low. Status quo

Option 3: Reduce to 1000 bytes

Description: More aggressive reduction to 1000 bytes.

Technical Characteristics:

~250-300 tokens per chunk
Very fine-grained retrieval

Advantages:

Maximum precision
Minimal context waste

Disadvantages:

May fragment semantic units
Many more chunks to manage
Higher storage overhead
May lose broader context

Risk Assessment:

Technical Risk: Medium. May be too aggressive
Schedule Risk: Low. Simple change
Ecosystem Risk: Low. Backwards compatible

Decision

Reduce the default chunk size from 4000 to 2000 bytes.

The implementation will:

Change DEFAULT_CHUNK_SIZE constant from 4000 to 2000
Existing databases retain their original chunk sizes
Users can override with --chunk-size flag

Consequences

Positive

Improved precision: Search results contain more focused, relevant content
Better context efficiency: Less wasted tokens in LLM context
Embedding alignment: 2000 bytes fits well within embedding model optimal ranges
User satisfaction: Addresses feedback about imprecise results

Negative

More chunks: Documents produce ~2x more chunks than before
Re-chunking needed: Users wanting new default must reload documents
Storage increase: Slightly more metadata per document

Neutral

Backward compatibility: Existing databases continue to work

Decision Outcome

The 2000-byte default provides a better balance of precision and coherence based on production usage feedback. Users who prefer larger chunks can still use --chunk-size 4000.

Mitigations:

Document the change in CHANGELOG
Provide migration guidance for users wanting to re-chunk
Keep --chunk-size flag for customization

ADR-004: Multiple Chunking Strategies - Chunking framework
ADR-008: Hybrid Search - Chunk size affects search quality

More Information

Date: 2025-01-18
Source: v1.1.2 release based on user feedback
Related ADRs: ADR-004, ADR-008

Audit

2025-01-20

Status: Compliant

Findings:

Finding	Files	Lines	Assessment
DEFAULT_CHUNK_SIZE = 2000	`src/chunking/mod.rs`	-	compliant
—chunk-size flag available	`src/main.rs`	-	compliant
CHANGELOG documents change	`CHANGELOG.md`	v1.1.2	compliant

Summary: Default chunk size reduced to 2000 bytes with CLI override available.

Action Required: None

2026-01-19

Status: Superseded in practice

Findings:

Finding	Files	Lines	Assessment
DEFAULT_CHUNK_SIZE = 3000	`src/chunking/mod.rs`	-	changed from 2000
MAX_CHUNK_SIZE = 50000	`src/chunking/mod.rs`	-	reduced from 250000

Summary: In v1.1.2, the default chunk size was revised to 3,000 characters. The implementation had drifted from this ADR’s 2,000-byte target to 240,000 before the v1.1.2 correction. The maximum was also reduced to 50,000 (from 250,000). The CLI --chunk-size override remains available.

Action Required: None — the spirit of this ADR (reduce chunk size for better search precision) remains valid. The exact value (3,000) is documented in the CHANGELOG under v1.1.2.

ADR-009: Reduced Default Chunk Size

ADR-009: Reduced Default Chunk Size

Status

Context

Background and Problem Statement

Current Limitations

Decision Drivers

Primary Decision Drivers

Secondary Decision Drivers

Considered Options

Option 1: Reduce to 2000 bytes

Option 2: Keep 4000 bytes

Option 3: Reduce to 1000 bytes

Decision

Consequences

Positive

Negative

Neutral

Decision Outcome

Related Decisions

Links

More Information

Audit

2025-01-20

2026-01-19