Context Injection: A Separate Tool for Memory-Augmented LLMs

Your persistent memory server has a search tool. It works. The model calls it, gets back ranked results with scores and metadata, browses through them, decides what matters. Search is an exploration tool, and exploration tools should return rich, complete data.

Then the model wants to inject relevant memories into its working context. It calls the same search tool, gets back 3000 tokens of results it already evaluated, and now has to manually extract, reformat, and truncate. Half the results don’t fit. The model burns tokens re-processing data it already saw. The context window fills with raw search output instead of usable knowledge.

This is the wrong tool for the job. Search and context injection solve different problems. Collapsing them into one tool creates something bad at both.

Two tools, two interaction patterns

Search is for exploration. The model browses results, reads metadata, checks relevance scores, and makes decisions about what to do next. A good search tool returns everything: full content, confidence scores, source references, timestamps. The model needs that data to reason about which memories matter.

Context injection is for prompt enrichment. The model says “give me what I need to know about deployment failures, fitted into 1500 tokens.” It gets back formatted text, budget-clamped and ready to use. No browsing. No evaluation. The model trusts the tool to prioritize and format.

These patterns differ in three ways:

Who decides relevance. Search returns candidates and lets the model decide. Context injection makes the relevance decision internally, ranking and filtering before the model sees anything.

Token accountability. Search returns whatever it finds, unbounded. Context injection operates under a strict budget. If you ask for 2000 tokens, you get at most 2000 tokens. The tool owns that constraint, not the model.

Output shape. Search returns structured records with scores and metadata for programmatic consumption. Context injection returns formatted prose or structured text ready to paste into a prompt.

You could add flags to a search tool: --format=prompt --max-tokens=2000 --skip-metadata. But now you have one tool with two conflicting responsibilities and a parameter list that grows every quarter. Two tools with clear jobs beat one tool with twelve flags.

Why a Tool, not a Prompt or Resource

The MCP specification (2025-03-26) defines three server primitives for exposing capabilities. Tools are model-controlled: the LLM decides when to call them during reasoning, invokes them programmatically, gets computed results back. Prompts are user-controlled: a human selects a template, the client expands it. Resources are application-controlled: the host decides which data to expose and when, typically static or semi-static content like files and configuration.

Context injection belongs as a Tool because the model decides when it needs memory context. A coding assistant debugging an error might call it mid-reasoning: “I need context on past debugging sessions related to this stack trace, fitted into 1500 tokens.” That decision happens inside the model’s reasoning loop, not from a user action or application lifecycle event.

The tool also performs real computation. It searches, ranks, estimates token costs, selects detail levels, formats output. This is not static data retrieval. A Resource would need to know the query in advance. A Prompt would need the user to trigger it. Neither fits.

Token budgeting with cascading detail levels

This is the core pattern. The caller specifies a token budget, and the tool fills it intelligently using cascading detail levels.

Every memory can be rendered at three levels of detail:

Full: complete content with metadata, tags, related entities, and source references. Might run 400-800 tokens for a substantial memory.

Medium: content with key metadata. Drops auxiliary fields. Typically 150-300 tokens.

Light: title, one-line summary, and core tags. Usually 30-60 tokens.

The algorithm walks through results in relevance order:

given: ranked memories, token budget
remaining = budget

for each memory in rank order:
    full_rendering = render(memory, detail="full")
    if estimate_tokens(full_rendering) <= remaining:
        emit full_rendering
        remaining -= estimate_tokens(full_rendering)
        continue

    medium_rendering = render(memory, detail="medium")
    if estimate_tokens(medium_rendering) <= remaining:
        emit medium_rendering
        remaining -= estimate_tokens(medium_rendering)
        continue

    light_rendering = render(memory, detail="light")
    if estimate_tokens(light_rendering) <= remaining:
        emit light_rendering
        remaining -= estimate_tokens(light_rendering)
        continue

    stop  // even light doesn't fit, budget exhausted

Token estimation uses a conservative ceil(characters / 4) heuristic. This slightly overestimates, which is the right direction. Underestimating risks blowing the budget. Overestimating wastes a small amount of space but keeps the contract honest.

The cascade produces a natural information gradient. Your top-ranked memory gets full detail with complete context. The next few get medium detail. Lower-ranked memories get one-line summaries. The least relevant memories get dropped entirely. The model gets the most important information at the highest fidelity, and everything fits inside the budget.

Before and after

Here is what raw search returns for a query about “database migration failures” (approximately 3200 tokens, truncated for illustration):

{
  "results": [
    {
      "id": "a1b2c3",
      "score": 0.94,
      "title": "PostgreSQL migration timeout on large tables",
      "content": "Migration on the orders table failed after 45 minutes...",
      "namespace": "incidents",
      "tags": ["postgresql", "migration", "timeout"],
      "created": "2026-02-14T09:30:00Z",
      "related_entities": ["orders_table", "pg_migration_v3"],
      "source_ref": "session-2026-02-14"
    },
    {
      "id": "d4e5f6",
      "score": 0.87,
      "title": "Batch size fix for migration timeouts",
      "content": "Reducing batch size from 10000 to 2000 rows resolved...",
      "namespace": "knowledge",
      "tags": ["postgresql", "migration", "performance"],
      "created": "2026-02-15T14:22:00Z",
      "related_entities": ["orders_table"],
      "source_ref": "session-2026-02-15"
    },
    {
      "id": "g7h8i9",
      "score": 0.71,
      "title": "Decision: use pt-online-schema-change for large tables",
      "content": "After three failed attempts at direct ALTER TABLE...",
      "namespace": "decisions",
      "tags": ["postgresql", "schema-change", "tooling"],
      "created": "2026-02-18T11:00:00Z",
      "related_entities": [],
      "source_ref": "session-2026-02-18"
    }
  ]
}

Here is what context injection returns for the same query with a 500-token budget:

## Relevant Context: database migration failures

**PostgreSQL migration timeout on large tables** (incidents, score: 0.94)
Migration on the orders table failed after 45 minutes due to lock
contention. The table had 12M rows and the ALTER TABLE acquired an
ACCESS EXCLUSIVE lock. Connection pool exhausted at minute 38. Resolved
by splitting into batched operations.
Tags: postgresql, migration, timeout

**Batch size fix for migration timeouts** (knowledge, score: 0.87)
Reducing batch size from 10000 to 2000 rows resolved timeout issues.

**Decision: use pt-online-schema-change for large tables** (decisions)
Adopted pt-online-schema-change after three failed direct ALTER attempts.

The first result got full detail because it ranked highest and the budget allowed it. The second got medium detail. The third got light detail. Total: approximately 480 tokens. The model gets actionable context without re-processing raw search output.

Avoiding the double-search problem

A common workflow creates a subtle inefficiency. The model searches memories, evaluates the results, identifies the relevant ones, then wants those memories injected as formatted context. The naive implementation of context injection runs its own internal search, duplicating work the model already did.

The fix: accept pre-fetched hit identifiers alongside scores. The caller passes a list of memory IDs with their relevance scores, and the tool loads those specific memories by ID instead of re-searching. The search and injection tools stay composable without redundant queries.

# Instead of re-searching:
context_injection:
  query: "database migration failures"
  max_tokens: 1500

# Pass results the model already found:
context_injection:
  hits:
    - id: "a1b2c3"
      score: 0.94
    - id: "d4e5f6"
      score: 0.87
    - id: "g7h8i9"
      score: 0.71
  max_tokens: 1500

Both paths produce the same output format. The second path skips the search entirely and jumps straight to ranking, rendering, and budget allocation. In practice, this eliminates 40-60ms of redundant search latency per call, which adds up in multi-turn conversations where the model injects context repeatedly.

Sensitivity filtering and access control

Injected context goes directly into LLM prompts. Whatever the tool returns influences the model’s reasoning and output. This makes access control more consequential than it is for exploratory search.

The pattern uses sensitivity levels on individual memories. Confidential memories are always excluded from injection output; no override, no flag to flip. The tool filters them before rendering begins, so they never appear in the budget allocation loop. Restricted memories are excluded by default but can be included with an explicit parameter, giving callers a deliberate opt-in for sensitive-but-needed context.

Scope-based access control gates the tool itself. Before executing any query or loading any pre-fetched hits, the tool checks whether the caller has a memory-read scope. Without it, the call fails before touching the memory store.

This layered approach means confidential data cannot leak into prompts through an injection call, even if the caller has broad search permissions. Search can return confidential results (with appropriate access) for human review. Injection cannot, because its output feeds directly into model reasoning with no human in the loop.

A note on evidence

I have not run benchmarks against this pattern. No A/B tests, no controlled measurements of retrieval quality or token efficiency across workloads. What I have is observation from daily use: the model reaches for context injection more often when it already has a clear picture of the query, usually from prior turns in the conversation where it ran a search and evaluated the results. In those cases, injection feels like the right tool. The model knows what it wants, it just needs the memories formatted and fitted.

Whether this holds up under rigorous measurement is an open question. The cascading detail algorithm is a greedy heuristic, not an optimality proof. The token estimation is a rough approximation. The whole pattern is a “feels right” design informed by building and using it daily, not by published research or controlled experiments. I am sharing it because the separation of concerns and the budgeting approach are useful ideas regardless of whether my specific implementation is optimal. Treat the numbers in this post as illustrative, not empirical.

Where this pattern pays off

A coding assistant encounters a stack trace it has seen before. Instead of searching memories and manually extracting the relevant debugging history, it calls context injection with the error signature and a 1500-token budget. It gets back a formatted summary of past incidents, root causes, and fixes, all fitted to the budget. The assistant applies that knowledge to the current error without burning half its context window on raw search results.

A planning agent needs decision history before proposing an architecture change. It injects context about past architectural decisions in the relevant namespace, budgeted to 2000 tokens. The top decisions get full rationale and tradeoff analysis. Lower-ranked decisions get one-line summaries. The agent plans with historical awareness without drowning in a complete decision log.

A multi-agent system runs three specialists: frontend, backend, and infrastructure. Each specialist gets domain-specific memory injected at task start, filtered by namespace, budgeted independently. The frontend agent gets 1000 tokens of UI decisions and component patterns. The backend agent gets 1500 tokens of API design history and service boundaries. Each agent starts work with relevant context at the right detail level, fitted to its role.

If you are building MCP servers that expose memory or knowledge systems, check whether your search tool is pulling double duty as a context injector. If it is, split them. The search tool gets simpler. The injection tool gets a real token budget.

The cascading detail algorithm is what makes context injection more than “search with a token limit.” Flat truncation wastes budget on low-relevance results rendered at full detail. Cascading detail spends every token on the highest-value information first, then degrades gracefully. The model gets the best possible context summary for any given budget, and you stop worrying about whether raw search output will blow past the context window.