RLM-Inspired Design

This document outlines how rlm-rs builds upon the Recursive Language Model (RLM) research paper while extending it for practical use in AI-assisted software development.

The RLM Foundation

Original Research

The Recursive Language Model (RLM) pattern, introduced in arXiv:2512.24601 by Zhang, Kraska, and Khattab (MIT CSAIL), addresses a fundamental limitation of large language models: fixed context windows.

Core Insight: Rather than trying to fit everything into a single context window, decompose large tasks into smaller subtasks processed by sub-LLMs, with a root LLM orchestrating the overall workflow.

Key RLM Principles

Hierarchical Decomposition: Break large documents into manageable chunks
Recursive Processing: Sub-LLMs process chunks independently
State Externalization: Persist intermediate results outside the LLM context
Result Aggregation: Synthesize sub-results into coherent final output

How rlm-rs Extends the Pattern

Inspiration vs. Implementation

rlm-cli takes the RLM paper’s theoretical foundation and translates it into a practical CLI tool optimized for AI-assisted coding workflows. Key extensions include:

RLM Paper Concept	rlm-rs Implementation	Extension
Document chunking	Semantic, fixed, parallel strategies	Content-aware boundaries
State persistence	SQLite with transactions	Schema versioning, reliability
Sub-LLM invocation	Pass-by-reference via chunk IDs	Zero-copy retrieval
Result aggregation	Buffer storage for intermediate results	Named buffers, variables
Similarity search	Hybrid semantic + BM25 with RRF	Multi-signal ranking

Novel Contributions

1. Pass-by-Reference Architecture

Instead of copying chunk content into prompts, rlm-rs uses chunk IDs that subagents can dereference:

# Root agent searches for relevant chunks
rlm-cli search "authentication errors" --format json | jq '.results[].chunk_id'
# Returns: 42, 17, 89

# Subagent retrieves specific chunk by ID
rlm-cli chunk get 42
# Returns: Full chunk content

Benefits:

Reduces context usage in orchestration layer
Enables parallel subagent processing
Maintains single source of truth in SQLite

2. Hybrid Search with Reciprocal Rank Fusion

The paper focuses on semantic similarity. rlm-rs combines multiple retrieval signals:

Why RRF? Semantic search excels at conceptual similarity; BM25 excels at exact keyword matching. Combining them handles both “what does this mean?” and “where is this term?” queries.

3. Content-Aware Chunking

The paper treats chunking as a preprocessing step. rlm-rs makes it a first-class concern:

Strategy	Algorithm	Best For
Semantic	Unicode sentence/paragraph boundaries	Markdown, code, prose
Fixed	Character boundaries with UTF-8 safety	Logs, raw text
Parallel	Rayon-parallelized fixed chunking	Large files (>10MB)

# Semantic chunking preserves natural boundaries
rlm-cli load README.md --chunker semantic

# Fixed chunking for uniform sizes
rlm-cli load server.log --chunker fixed --chunk-size 50000

# Parallel chunking for speed on large files
rlm-cli load huge-dump.txt --chunker parallel

4. Auto-Embedding on Load

Embeddings are generated automatically during document ingestion:

rlm-cli load document.md --name docs
# Output: Loaded document.md as 'docs' (15 chunks, embeddings generated)

This eliminates a separate embedding step and ensures search is always available.

Architectural Layers

Layer Responsibilities

Layer	Purpose	Key Types
CLI	Parse args, dispatch commands, format output	`Cli`, `Commands`, `OutputFormat`
Core	Domain models with business logic	`Buffer`, `Chunk`, `Context`
Chunking	Split content into processable units	`Chunker` trait, strategies
Search	Find relevant chunks for queries	`SearchConfig`, RRF fusion
Storage	Persist state across sessions	`Storage` trait, SQLite
I/O	Efficient file operations	Memory-mapped reads, UTF-8

Design Principles

1. CLI-First Interface

rlm-cli is designed as a command-line tool that any AI assistant can invoke via shell:

# Any AI assistant can run these commands
rlm-cli init
rlm-cli load document.md --name docs
rlm-cli search "error handling" --format json
rlm-cli chunk get 42

This means rlm-rs works with:

Claude Code (via Bash tool)
GitHub Copilot (via terminal)
Codex CLI (via shell execution)
OpenCode (via command execution)
Any tool that can run shell commands

2. JSON Output for Integration

All commands support --format json for programmatic consumption:

rlm-cli --format json search "authentication" --top-k 5

{
  "count": 3,
  "mode": "hybrid",
  "query": "authentication",
  "results": [
    {"chunk_id": 42, "score": 0.0328, "semantic_score": 0.0499, "bm25_score": 1.6e-6},
    {"chunk_id": 17, "score": 0.0323, "semantic_score": 0.0457, "bm25_score": 1.2e-6}
  ]
}

3. Zero External Dependencies at Runtime

rlm-cli is a single static binary with embedded:

Embedding model (BGE-M3 via fastembed, 1024 dimensions)
SQLite (via rusqlite)
Full-text search (FTS5)

No Python, no external services, no API keys required.

4. Transactional Reliability

All state mutations use SQLite transactions:

// Pseudocode from storage layer
fn add_buffer(&mut self, buffer: &Buffer) -> Result<i64> {
    let tx = self.conn.transaction()?;
    // Insert buffer
    // Insert chunks
    // Generate embeddings
    tx.commit()?;  // All-or-nothing
    Ok(buffer_id)
}

Comparison with Original RLM

Aspect	RLM Paper	rlm-rs
Primary Use Case	General long-context tasks	Code analysis & development
State Management	Abstract “external environment”	Concrete SQLite database
Retrieval	Semantic similarity	Hybrid semantic + BM25
Chunking	Fixed-size	Content-aware strategies
Integration	Research prototype	Production CLI tool
Embedding	External service	Embedded model (offline)
Output	Unspecified	Text + JSON formats

Why This Approach?

Problem: Context Window Limits

Modern LLMs have context windows of 100K-200K tokens, but:

Large codebases exceed this easily
Full context = slower inference + higher cost
Irrelevant context degrades response quality

Solution: Intelligent Chunking + Retrieval

Load once: Chunk and embed documents upfront
Search smart: Find only relevant chunks for each query
Process targeted: Subagents work on specific chunks
Synthesize: Root agent combines results

Result: 100x Effective Context

A 10MB codebase (~2.5M tokens) can be:

Chunked into ~800 chunks of 3K chars each
Searched to find top 10 relevant chunks
Processed by subagents in parallel
Synthesized into coherent analysis

Future Directions

Building on the RLM foundation, planned extensions include:

Streaming Processing: Process chunks as they’re generated
Incremental Updates: Re-embed only changed content
Cross-Buffer Search: Find patterns across multiple documents
Agent Memory: Persistent learning from previous analyses
Distributed Processing: Parallel subagent execution

References

Zhang, X., Kraska, T., & Khattab, O. (2025). Recursive Language Models. arXiv:2512.24601
claude_code_RLM - Python implementation that inspired this project
fastembed - Rust embedding library
rusqlite - SQLite bindings for Rust