Skip to content

RLM-Inspired Design

This document outlines how rlm-rs builds upon the Recursive Language Model (RLM) research paper while extending it for practical use in AI-assisted software development.

The Recursive Language Model (RLM) pattern, introduced in arXiv:2512.24601 by Zhang, Kraska, and Khattab (MIT CSAIL), addresses a fundamental limitation of large language models: fixed context windows.

Core Insight: Rather than trying to fit everything into a single context window, decompose large tasks into smaller subtasks processed by sub-LLMs, with a root LLM orchestrating the overall workflow.

  1. Hierarchical Decomposition: Break large documents into manageable chunks
  2. Recursive Processing: Sub-LLMs process chunks independently
  3. State Externalization: Persist intermediate results outside the LLM context
  4. Result Aggregation: Synthesize sub-results into coherent final output

rlm-cli takes the RLM paper’s theoretical foundation and translates it into a practical CLI tool optimized for AI-assisted coding workflows. Key extensions include:

RLM Paper Conceptrlm-rs ImplementationExtension
Document chunkingSemantic, fixed, parallel strategiesContent-aware boundaries
State persistenceSQLite with transactionsSchema versioning, reliability
Sub-LLM invocationPass-by-reference via chunk IDsZero-copy retrieval
Result aggregationBuffer storage for intermediate resultsNamed buffers, variables
Similarity searchHybrid semantic + BM25 with RRFMulti-signal ranking

Instead of copying chunk content into prompts, rlm-rs uses chunk IDs that subagents can dereference:

Terminal window
# Root agent searches for relevant chunks
rlm-cli search "authentication errors" --format json | jq '.results[].chunk_id'
# Returns: 42, 17, 89
# Subagent retrieves specific chunk by ID
rlm-cli chunk get 42
# Returns: Full chunk content

Benefits:

  • Reduces context usage in orchestration layer
  • Enables parallel subagent processing
  • Maintains single source of truth in SQLite

2. Hybrid Search with Reciprocal Rank Fusion

Section titled “2. Hybrid Search with Reciprocal Rank Fusion”

The paper focuses on semantic similarity. rlm-rs combines multiple retrieval signals:

Why RRF? Semantic search excels at conceptual similarity; BM25 excels at exact keyword matching. Combining them handles both “what does this mean?” and “where is this term?” queries.

The paper treats chunking as a preprocessing step. rlm-rs makes it a first-class concern:

StrategyAlgorithmBest For
SemanticUnicode sentence/paragraph boundariesMarkdown, code, prose
FixedCharacter boundaries with UTF-8 safetyLogs, raw text
ParallelRayon-parallelized fixed chunkingLarge files (>10MB)
Terminal window
# Semantic chunking preserves natural boundaries
rlm-cli load README.md --chunker semantic
# Fixed chunking for uniform sizes
rlm-cli load server.log --chunker fixed --chunk-size 50000
# Parallel chunking for speed on large files
rlm-cli load huge-dump.txt --chunker parallel

Embeddings are generated automatically during document ingestion:

Terminal window
rlm-cli load document.md --name docs
# Output: Loaded document.md as 'docs' (15 chunks, embeddings generated)

This eliminates a separate embedding step and ensures search is always available.


LayerPurposeKey Types
CLIParse args, dispatch commands, format outputCli, Commands, OutputFormat
CoreDomain models with business logicBuffer, Chunk, Context
ChunkingSplit content into processable unitsChunker trait, strategies
SearchFind relevant chunks for queriesSearchConfig, RRF fusion
StoragePersist state across sessionsStorage trait, SQLite
I/OEfficient file operationsMemory-mapped reads, UTF-8

rlm-cli is designed as a command-line tool that any AI assistant can invoke via shell:

Terminal window
# Any AI assistant can run these commands
rlm-cli init
rlm-cli load document.md --name docs
rlm-cli search "error handling" --format json
rlm-cli chunk get 42

This means rlm-rs works with:

  • Claude Code (via Bash tool)
  • GitHub Copilot (via terminal)
  • Codex CLI (via shell execution)
  • OpenCode (via command execution)
  • Any tool that can run shell commands

All commands support --format json for programmatic consumption:

Terminal window
rlm-cli --format json search "authentication" --top-k 5
{
"count": 3,
"mode": "hybrid",
"query": "authentication",
"results": [
{"chunk_id": 42, "score": 0.0328, "semantic_score": 0.0499, "bm25_score": 1.6e-6},
{"chunk_id": 17, "score": 0.0323, "semantic_score": 0.0457, "bm25_score": 1.2e-6}
]
}

rlm-cli is a single static binary with embedded:

  • Embedding model (BGE-M3 via fastembed, 1024 dimensions)
  • SQLite (via rusqlite)
  • Full-text search (FTS5)

No Python, no external services, no API keys required.

All state mutations use SQLite transactions:

// Pseudocode from storage layer
fn add_buffer(&mut self, buffer: &Buffer) -> Result<i64> {
let tx = self.conn.transaction()?;
// Insert buffer
// Insert chunks
// Generate embeddings
tx.commit()?; // All-or-nothing
Ok(buffer_id)
}

AspectRLM Paperrlm-rs
Primary Use CaseGeneral long-context tasksCode analysis & development
State ManagementAbstract “external environment”Concrete SQLite database
RetrievalSemantic similarityHybrid semantic + BM25
ChunkingFixed-sizeContent-aware strategies
IntegrationResearch prototypeProduction CLI tool
EmbeddingExternal serviceEmbedded model (offline)
OutputUnspecifiedText + JSON formats

Modern LLMs have context windows of 100K-200K tokens, but:

  • Large codebases exceed this easily
  • Full context = slower inference + higher cost
  • Irrelevant context degrades response quality

Solution: Intelligent Chunking + Retrieval

Section titled “Solution: Intelligent Chunking + Retrieval”
  1. Load once: Chunk and embed documents upfront
  2. Search smart: Find only relevant chunks for each query
  3. Process targeted: Subagents work on specific chunks
  4. Synthesize: Root agent combines results

A 10MB codebase (~2.5M tokens) can be:

  • Chunked into ~800 chunks of 3K chars each
  • Searched to find top 10 relevant chunks
  • Processed by subagents in parallel
  • Synthesized into coherent analysis

Building on the RLM foundation, planned extensions include:

  1. Streaming Processing: Process chunks as they’re generated
  2. Incremental Updates: Re-embed only changed content
  3. Cross-Buffer Search: Find patterns across multiple documents
  4. Agent Memory: Persistent learning from previous analyses
  5. Distributed Processing: Parallel subagent execution

  • Zhang, X., Kraska, T., & Khattab, O. (2025). Recursive Language Models. arXiv:2512.24601
  • claude_code_RLM - Python implementation that inspired this project
  • fastembed - Rust embedding library
  • rusqlite - SQLite bindings for Rust