Large Result Offloading: Stop Stuffing Tool Outputs into Context Windows
Your agent calls recall_memories and gets back 247 records. That’s 38,000 tokens of structured data crammed into context, half the window gone, so the model can look at maybe five of them.
This is the context window overflow problem, and it gets worse as agents gain more tools. Every MCP integration adds schema overhead and result payloads. The more capable your agent becomes, the more context it wastes on raw data it never reads.
Large Result Offloading (LRO) is a pattern I formalized to address this. The core idea is simple: don’t inject large tool results into context. Materialize them to a file and give the agent a compact descriptor instead.
The problem in three parts
You pay for tokens the model ignores. If a tool returns n records and the agent uses k of them where k is much smaller than n, you burn (n - k) / n of that cost for nothing.
Attention degrades over long sequences. The “lost in the middle” phenomenon means information buried in a wall of tool output gets systematically less attention. You’re not just wasting tokens. You’re actively degrading reasoning quality.
Data displaces thinking. Every token consumed by tool output is a token unavailable for reasoning, planning, and response generation. Those are the operations that actually matter.
How LRO works
When a tool invocation produces a result set whose estimated token count exceeds a threshold (default: 6,400 tokens), the system does two things.
First, it writes the full result set to a JSONL file, one record per line, with a header line containing metadata.
Second, it returns a compact descriptor to the agent containing:
- Summary statistics (record count, token estimate, namespace distribution, score range)
- A JSON Schema for each record line
- A library of 10
jqextraction queries covering common access patterns
The agent gets an 800-token descriptor instead of a 38,000-token data dump. When it needs specific records, it runs an extraction query against the file. The data stays intact. The agent pulls what it needs, when it needs it.
Operation executes -> Estimate tokens -> Compare against threshold -> Branch:
Under threshold: Return inline (conventional path)
Over threshold: Materialize to JSONL -> Return compact descriptor
The descriptor is the interface
The key insight is not the file offloading. Anyone can write results to a file. The contribution is the compact descriptor as a typed contract between the tool subsystem and the reasoning subsystem.
The extraction query library is the critical piece. Instead of handing the agent a file path and hoping it figures out how to query it, LRO provides pre-computed queries that encode domain knowledge about common access patterns:
| Pattern | What it does |
|---|---|
| Enumeration | Tabular listing of titles with confidence scores |
| Filtering | Namespace-based subset selection |
| Search | Case-insensitive keyword matching across records |
| Aggregation | Group-by distribution analysis |
| Ranking | Top-k extraction by score |
All queries use jq against the JSONL file, composable via Unix pipelines. The agent picks a query pattern, runs it, and gets a focused subset. No need to understand the underlying data format.
What this is not
Not compression. LLMLingua, ACON, Selective Context: these trade fidelity for space. LRO preserves the complete result set. Nothing is lost. Delivery is deferred.
Not generic memory management. MemGPT provides the agent with general-purpose memory primitives and lets it figure out retrieval strategy. LRO provides result-specific extraction queries tailored to a particular tool’s output schema. The descriptor includes the queries. The agent does not need to formulate a retrieval strategy.
Not hierarchical retrieval. A-RAG optimizes how you find information across a corpus. LRO optimizes how the agent consumes a specific result set already retrieved.
The savings math
For a result set of n records at average token cost t per record, with the agent extracting k records and the descriptor costing d tokens:
savings = 1 - (d + k * t) / (n * t)
With representative numbers (247 records, 155 tokens each, 5 extracted, 800-token descriptor):
savings = 1 - (800 + 5 * 155) / (247 * 155) = 96%
That 96% is parametric. It depends on the access ratio k/n. If your agent inspects most records, savings drop. If it does targeted lookups in large result sets, savings go higher. The pattern wins whenever the working set is a small fraction of the total. In practice with memory recall operations, it almost always is.
A database cursor for LLM context
The closest analogy is database cursor management. When a SQL query produces a result set exceeding client buffer capacity, the database materializes server-side and provides a cursor for demand-driven access. In LRO, the extraction query library is the cursor and the context window is the client buffer.
Li et al. made this connection explicit in their VLDB paper on database perspectives for LLM inference. Decades of database research on buffer management and result spooling directly inform LLM system design. LRO is a concrete instantiation of that observation.
Current status
LRO is specified as an extension in the atlatl specification, an implementation framework for the Memory Interchange Format (MIF). MIF is at version 0.1.0-draft. This is early-stage work, not an adopted standard.
The full specification includes configuration schema, conformance levels, and the complete extraction query library. An accompanying academic paper with formal definitions, related work analysis, and compression-based comparisons is in progress and will be published separately.
What’s next
LRO is a design contribution. The savings formula is parametric, not empirical. The work that needs doing:
Ablation studies. What happens when agents get the descriptor without extraction queries? How sensitive is performance to threshold selection?
Benchmarks. Does LRO maintain task accuracy compared to full inline injection, truncation, and compression baselines?
Agent behavior logging. Do agents actually use the extraction queries? How often? Do they compose compound queries?
Latency measurements. Does the materialization overhead pay for itself in reduced inference time?
If you build tool-augmented agent systems and run into context overflow, the pattern is straightforward to implement independent of MIF or atlatl. The core mechanism, threshold-gated materialization with a guided descriptor, applies to any high-cardinality tool output.
The source repository will be published on GitHub once the specification stabilizes. Feedback, critique, and independent implementations are welcome.