Skip to content

ADR-004: Multiple Chunking Strategies

Accepted

The RLM pattern requires breaking large documents into chunks for:

  • Embedding generation (models have token limits)
  • Semantic search (smaller chunks = more precise retrieval)
  • Context assembly (select relevant chunks for LLM input)

Different content types benefit from different chunking approaches:

  • Code benefits from syntax-aware chunking
  • Prose benefits from paragraph/sentence boundaries
  • Structured data may need fixed-size chunks
  1. Single strategy: One-size-fits-all chunking loses semantic coherence
  2. Token limits: Embedding models have maximum input sizes
  3. Search precision: Large chunks reduce retrieval precision
  1. Content-aware chunking: Preserve semantic units (paragraphs, functions, etc.)
  2. Token budget compliance: Chunks must fit within embedding model limits
  3. Extensibility: Users should be able to add custom strategies
  1. Overlap support: Allow overlapping chunks to preserve context at boundaries
  2. Metadata preservation: Track byte offsets and line numbers for each chunk
  3. Performance: Chunking should be fast even for large files

Description: Define a ChunkingStrategy trait with multiple implementations users can select.

Technical Characteristics:

  • Trait-based abstraction
  • Runtime strategy selection
  • Consistent chunk metadata across strategies

Advantages:

  • Users choose appropriate strategy per content type
  • Easy to add new strategies
  • Consistent interface for all strategies
  • Strategies can be combined or chained

Disadvantages:

  • More code complexity than single implementation
  • Users must understand options

Risk Assessment:

  • Technical Risk: Low. Strategy pattern is well-understood
  • Schedule Risk: Low. Core strategies straightforward
  • Ecosystem Risk: Low. No external dependencies

Description: One smart chunker that auto-detects content type and adapts.

Technical Characteristics:

  • Content type detection
  • Heuristic-based splitting
  • Single code path

Advantages:

  • Simpler API (no strategy selection)
  • “Just works” for most cases

Disadvantages:

  • Heuristics may fail for edge cases
  • Hard to tune for specific needs
  • Complex implementation

Disqualifying Factor: Cannot handle all content types well with one algorithm.

Risk Assessment:

  • Technical Risk: High. Heuristics are fragile
  • Schedule Risk: Medium. Detection logic complex
  • Ecosystem Risk: Low. Self-contained

Description: Use an existing text chunking library.

Technical Characteristics:

  • External dependency
  • Pre-built strategies
  • May have language detection

Advantages:

  • Faster initial development
  • Battle-tested implementations

Disadvantages:

  • Limited Rust options available
  • May not match exact requirements
  • Additional dependency

Disqualifying Factor: No suitable Rust library with required features existed at project inception.

Risk Assessment:

  • Technical Risk: Medium. Dependency quality varies
  • Schedule Risk: Low. If library exists
  • Ecosystem Risk: Medium. Dependency maintenance

Implement a pluggable chunking system with multiple strategies via a ChunkingStrategy trait.

The implementation will provide:

  • FixedSize: Simple byte/character count splitting
  • Paragraph: Split on blank lines (markdown/prose)
  • Sentence: Split on sentence boundaries
  • Sliding Window: Overlapping chunks for context preservation
  • Recursive: Tree-sitter or regex-based for code
  1. Content-appropriate chunking: Users select strategy matching their content
  2. Extensibility: New strategies can be added without changing core code
  3. Consistent metadata: All strategies produce chunks with offsets and line numbers
  4. Testability: Each strategy can be tested in isolation
  1. User choice required: Users must understand which strategy to use
  2. More code: Multiple implementations to maintain
  3. Potential confusion: Too many options can overwhelm
  1. Default strategy: Providing a sensible default mitigates choice paralysis

The strategy pattern enables rlm-rs to handle diverse content types effectively. The --strategy CLI flag lets users select the appropriate chunker, with paragraph chunking as a sensible default.

Mitigations:

  • Good documentation explaining when to use each strategy
  • Sensible defaults for common cases
  • Clear error messages when chunks exceed limits
  • Date: 2025-01-01
  • Source: Project inception design decisions
  • Related ADRs: ADR-001, ADR-009

Status: Compliant

Findings:

FindingFilesLinesAssessment
ChunkingStrategy trait definedsrc/chunking/strategy.rs-compliant
Multiple strategies implementedsrc/chunking/allcompliant
CLI —strategy flagsrc/main.rs-compliant
Chunk metadata trackedsrc/chunking/chunk.rs-compliant

Summary: Pluggable chunking system fully implemented with multiple strategies.

Action Required: None