ADR-004: Multiple Chunking Strategies
ADR-004: Multiple Chunking Strategies
Section titled “ADR-004: Multiple Chunking Strategies”Status
Section titled “Status”Accepted
Context
Section titled “Context”Background and Problem Statement
Section titled “Background and Problem Statement”The RLM pattern requires breaking large documents into chunks for:
- Embedding generation (models have token limits)
- Semantic search (smaller chunks = more precise retrieval)
- Context assembly (select relevant chunks for LLM input)
Different content types benefit from different chunking approaches:
- Code benefits from syntax-aware chunking
- Prose benefits from paragraph/sentence boundaries
- Structured data may need fixed-size chunks
Current Limitations
Section titled “Current Limitations”- Single strategy: One-size-fits-all chunking loses semantic coherence
- Token limits: Embedding models have maximum input sizes
- Search precision: Large chunks reduce retrieval precision
Decision Drivers
Section titled “Decision Drivers”Primary Decision Drivers
Section titled “Primary Decision Drivers”- Content-aware chunking: Preserve semantic units (paragraphs, functions, etc.)
- Token budget compliance: Chunks must fit within embedding model limits
- Extensibility: Users should be able to add custom strategies
Secondary Decision Drivers
Section titled “Secondary Decision Drivers”- Overlap support: Allow overlapping chunks to preserve context at boundaries
- Metadata preservation: Track byte offsets and line numbers for each chunk
- Performance: Chunking should be fast even for large files
Considered Options
Section titled “Considered Options”Option 1: Pluggable Strategy Pattern
Section titled “Option 1: Pluggable Strategy Pattern”Description: Define a ChunkingStrategy trait with multiple implementations users can select.
Technical Characteristics:
- Trait-based abstraction
- Runtime strategy selection
- Consistent chunk metadata across strategies
Advantages:
- Users choose appropriate strategy per content type
- Easy to add new strategies
- Consistent interface for all strategies
- Strategies can be combined or chained
Disadvantages:
- More code complexity than single implementation
- Users must understand options
Risk Assessment:
- Technical Risk: Low. Strategy pattern is well-understood
- Schedule Risk: Low. Core strategies straightforward
- Ecosystem Risk: Low. No external dependencies
Option 2: Single Adaptive Strategy
Section titled “Option 2: Single Adaptive Strategy”Description: One smart chunker that auto-detects content type and adapts.
Technical Characteristics:
- Content type detection
- Heuristic-based splitting
- Single code path
Advantages:
- Simpler API (no strategy selection)
- “Just works” for most cases
Disadvantages:
- Heuristics may fail for edge cases
- Hard to tune for specific needs
- Complex implementation
Disqualifying Factor: Cannot handle all content types well with one algorithm.
Risk Assessment:
- Technical Risk: High. Heuristics are fragile
- Schedule Risk: Medium. Detection logic complex
- Ecosystem Risk: Low. Self-contained
Option 3: External Chunking Library
Section titled “Option 3: External Chunking Library”Description: Use an existing text chunking library.
Technical Characteristics:
- External dependency
- Pre-built strategies
- May have language detection
Advantages:
- Faster initial development
- Battle-tested implementations
Disadvantages:
- Limited Rust options available
- May not match exact requirements
- Additional dependency
Disqualifying Factor: No suitable Rust library with required features existed at project inception.
Risk Assessment:
- Technical Risk: Medium. Dependency quality varies
- Schedule Risk: Low. If library exists
- Ecosystem Risk: Medium. Dependency maintenance
Decision
Section titled “Decision”Implement a pluggable chunking system with multiple strategies via a ChunkingStrategy trait.
The implementation will provide:
- FixedSize: Simple byte/character count splitting
- Paragraph: Split on blank lines (markdown/prose)
- Sentence: Split on sentence boundaries
- Sliding Window: Overlapping chunks for context preservation
- Recursive: Tree-sitter or regex-based for code
Consequences
Section titled “Consequences”Positive
Section titled “Positive”- Content-appropriate chunking: Users select strategy matching their content
- Extensibility: New strategies can be added without changing core code
- Consistent metadata: All strategies produce chunks with offsets and line numbers
- Testability: Each strategy can be tested in isolation
Negative
Section titled “Negative”- User choice required: Users must understand which strategy to use
- More code: Multiple implementations to maintain
- Potential confusion: Too many options can overwhelm
Neutral
Section titled “Neutral”- Default strategy: Providing a sensible default mitigates choice paralysis
Decision Outcome
Section titled “Decision Outcome”The strategy pattern enables rlm-rs to handle diverse content types effectively. The --strategy CLI flag lets users select the appropriate chunker, with paragraph chunking as a sensible default.
Mitigations:
- Good documentation explaining when to use each strategy
- Sensible defaults for common cases
- Clear error messages when chunks exceed limits
Related Decisions
Section titled “Related Decisions”- ADR-001: Adopt RLM Pattern - Chunking is core to RLM
- ADR-009: Reduced Default Chunk Size - Default size tuning
- text-splitter crate - Rust text splitting (evaluated)
More Information
Section titled “More Information”- Date: 2025-01-01
- Source: Project inception design decisions
- Related ADRs: ADR-001, ADR-009
2025-01-20
Section titled “2025-01-20”Status: Compliant
Findings:
| Finding | Files | Lines | Assessment |
|---|---|---|---|
| ChunkingStrategy trait defined | src/chunking/strategy.rs | - | compliant |
| Multiple strategies implemented | src/chunking/ | all | compliant |
| CLI —strategy flag | src/main.rs | - | compliant |
| Chunk metadata tracked | src/chunking/chunk.rs | - | compliant |
Summary: Pluggable chunking system fully implemented with multiple strategies.
Action Required: None