Skip to content

ADR-007: Embedded Embedding Model

Accepted

Semantic search requires converting text into embedding vectors that capture meaning. This requires an embedding model. The question is whether to:

  1. Call an external API (OpenAI, Cohere, etc.)
  2. Run a local model server
  3. Embed the model directly in the binary

This decision affects offline capability, privacy, latency, and distribution complexity.

  1. API dependencies: External APIs require network, keys, and incur costs
  2. Server processes: Local servers add deployment complexity
  3. Privacy concerns: Sending text to external services may leak sensitive data
  1. Offline capability: Must work without network connectivity
  2. Privacy: No data should leave the user’s machine
  3. Zero configuration: Should work out-of-the-box without API keys
  1. Latency: Local inference is faster than API calls
  2. Cost: No per-token charges
  3. Reliability: No dependency on external service availability

Description: Use fastembed-rs to run ONNX models directly within the Rust binary.

Technical Characteristics:

  • ONNX Runtime for inference
  • Model downloaded on first use
  • Lazy loading to minimize cold start
  • Thread-safe singleton pattern

Advantages:

  • Fully offline operation
  • No API keys or configuration
  • Fast local inference
  • Privacy preserved (no data leaves machine)
  • Consistent results (no API version drift)

Disadvantages:

  • Large binary size (ONNX runtime)
  • Initial model download required
  • CPU-only (no GPU acceleration in default build)
  • Memory overhead for model

Risk Assessment:

  • Technical Risk: Low. fastembed-rs is production-ready
  • Schedule Risk: Low. Drop-in integration
  • Ecosystem Risk: Low. ONNX is industry standard

Description: Call OpenAI, Cohere, or similar API for embeddings.

Technical Characteristics:

  • HTTP client for API calls
  • API key management
  • Rate limiting and retries

Advantages:

  • Smaller binary size
  • Access to latest models
  • GPU inference on server side

Disadvantages:

  • Requires network connectivity
  • API key management
  • Cost per token
  • Privacy concerns
  • Latency from network round-trips

Disqualifying Factor: Network dependency and privacy concerns conflict with offline-first CLI design.

Risk Assessment:

  • Technical Risk: Low. APIs are well-documented
  • Schedule Risk: Low. Simple HTTP client
  • Ecosystem Risk: Medium. API changes, pricing changes

Option 3: Local Model Server (Ollama, etc.)

Section titled “Option 3: Local Model Server (Ollama, etc.)”

Description: Require users to run a local embedding server.

Technical Characteristics:

  • HTTP client to localhost
  • External process management
  • Model management in server

Advantages:

  • Offloads inference to dedicated process
  • Potentially GPU acceleration
  • Model updates independent of rlm-rs

Disadvantages:

  • Additional installation step
  • Process management complexity
  • Port conflicts possible

Disqualifying Factor: Requiring a separate server conflicts with single-binary distribution goal.

Risk Assessment:

  • Technical Risk: Low. HTTP is simple
  • Schedule Risk: Medium. Documentation/setup guides needed
  • Ecosystem Risk: Medium. Server version compatibility

Embed the embedding model using fastembed-rs with ONNX runtime.

The implementation will use:

  • fastembed-rs for model management and inference
  • ONNX Runtime as the inference backend
  • Lazy loading to defer model download until first use
  • Thread-safe singleton for model instance sharing
  • Feature flag (fastembed-embeddings) to make embeddings optional
  1. Offline operation: Works without network after initial model download
  2. Privacy: No data leaves the user’s machine
  3. Zero configuration: No API keys or server setup required
  4. Consistent: Same model version produces reproducible results
  5. Fast: Local inference avoids network latency
  1. Binary size: ONNX runtime adds to binary size
  2. Initial download: First embedding operation downloads the model (~1.3GB for BGE-M3)
  3. Memory usage: Model loaded in memory during operation
  4. CPU-only: No GPU acceleration without custom build
  1. Feature flag: Embeddings can be disabled for smaller builds

Embedded embeddings via fastembed-rs enable rlm-rs to provide semantic search without external dependencies. The lazy loading pattern minimizes cold start impact for operations that don’t need embeddings.

Mitigations:

  • Lazy model loading to preserve cold start for non-embedding operations
  • Feature flag for builds that don’t need semantic search
  • Fallback embedder (hash-based) when feature is disabled
  • Clear messaging during model download
  • Date: 2025-01-15
  • Source: v1.0.0 release design decisions
  • Related ADRs: ADR-001, ADR-008, ADR-010

Status: Compliant

Findings:

FindingFilesLinesAssessment
fastembed-rs integrationsrc/embedding/fastembed_impl.rsallcompliant
Lazy loading singletonsrc/embedding/fastembed_impl.rsL14, L59-80compliant
Feature flag configuredCargo.tomlL17-18compliant
Fallback embedder availablesrc/embedding/fallback.rsallcompliant

Summary: Embedded embedding model fully implemented with lazy loading and fallback.

Action Required: None