ADR-007: Embedded Embedding Model

Status

Accepted

Context

Background and Problem Statement

Semantic search requires converting text into embedding vectors that capture meaning. This requires an embedding model. The question is whether to:

Call an external API (OpenAI, Cohere, etc.)
Run a local model server
Embed the model directly in the binary

This decision affects offline capability, privacy, latency, and distribution complexity.

Current Limitations

API dependencies: External APIs require network, keys, and incur costs
Server processes: Local servers add deployment complexity
Privacy concerns: Sending text to external services may leak sensitive data

Decision Drivers

Primary Decision Drivers

Offline capability: Must work without network connectivity
Privacy: No data should leave the user’s machine
Zero configuration: Should work out-of-the-box without API keys

Secondary Decision Drivers

Latency: Local inference is faster than API calls
Cost: No per-token charges
Reliability: No dependency on external service availability

Considered Options

Option 1: Embedded Model via fastembed-rs

Description: Use fastembed-rs to run ONNX models directly within the Rust binary.

Technical Characteristics:

ONNX Runtime for inference
Model downloaded on first use
Lazy loading to minimize cold start
Thread-safe singleton pattern

Advantages:

Fully offline operation
No API keys or configuration
Fast local inference
Privacy preserved (no data leaves machine)
Consistent results (no API version drift)

Disadvantages:

Large binary size (ONNX runtime)
Initial model download required
CPU-only (no GPU acceleration in default build)
Memory overhead for model

Risk Assessment:

Technical Risk: Low. fastembed-rs is production-ready
Schedule Risk: Low. Drop-in integration
Ecosystem Risk: Low. ONNX is industry standard

Option 2: External Embedding API

Description: Call OpenAI, Cohere, or similar API for embeddings.

Technical Characteristics:

HTTP client for API calls
API key management
Rate limiting and retries

Advantages:

Smaller binary size
Access to latest models
GPU inference on server side

Disadvantages:

Requires network connectivity
API key management
Cost per token
Privacy concerns
Latency from network round-trips

Disqualifying Factor: Network dependency and privacy concerns conflict with offline-first CLI design.

Risk Assessment:

Technical Risk: Low. APIs are well-documented
Schedule Risk: Low. Simple HTTP client
Ecosystem Risk: Medium. API changes, pricing changes

Option 3: Local Model Server (Ollama, etc.)

Description: Require users to run a local embedding server.

Technical Characteristics:

HTTP client to localhost
External process management
Model management in server

Advantages:

Offloads inference to dedicated process
Potentially GPU acceleration
Model updates independent of rlm-rs

Disadvantages:

Additional installation step
Process management complexity
Port conflicts possible

Disqualifying Factor: Requiring a separate server conflicts with single-binary distribution goal.

Risk Assessment:

Technical Risk: Low. HTTP is simple
Schedule Risk: Medium. Documentation/setup guides needed
Ecosystem Risk: Medium. Server version compatibility

Decision

Embed the embedding model using fastembed-rs with ONNX runtime.

The implementation will use:

fastembed-rs for model management and inference
ONNX Runtime as the inference backend
Lazy loading to defer model download until first use
Thread-safe singleton for model instance sharing
Feature flag (fastembed-embeddings) to make embeddings optional

Consequences

Positive

Offline operation: Works without network after initial model download
Privacy: No data leaves the user’s machine
Zero configuration: No API keys or server setup required
Consistent: Same model version produces reproducible results
Fast: Local inference avoids network latency

Negative

Binary size: ONNX runtime adds to binary size
Initial download: First embedding operation downloads the model (~1.3GB for BGE-M3)
Memory usage: Model loaded in memory during operation
CPU-only: No GPU acceleration without custom build

Neutral

Feature flag: Embeddings can be disabled for smaller builds

Decision Outcome

Embedded embeddings via fastembed-rs enable rlm-rs to provide semantic search without external dependencies. The lazy loading pattern minimizes cold start impact for operations that don’t need embeddings.

Mitigations:

Lazy model loading to preserve cold start for non-embedding operations
Feature flag for builds that don’t need semantic search
Fallback embedder (hash-based) when feature is disabled
Clear messaging during model download

ADR-001: Adopt RLM Pattern - Requires semantic embeddings
ADR-008: Hybrid Search - Uses embeddings for semantic component
ADR-010: Switch to BGE-M3 - Current model choice

More Information

Date: 2025-01-15
Source: v1.0.0 release design decisions
Related ADRs: ADR-001, ADR-008, ADR-010

Audit

2025-01-20

Status: Compliant

Findings:

Finding	Files	Lines	Assessment
fastembed-rs integration	`src/embedding/fastembed_impl.rs`	all	compliant
Lazy loading singleton	`src/embedding/fastembed_impl.rs`	L14, L59-80	compliant
Feature flag configured	`Cargo.toml`	L17-18	compliant
Fallback embedder available	`src/embedding/fallback.rs`	all	compliant

Summary: Embedded embedding model fully implemented with lazy loading and fallback.

Action Required: None

ADR-007: Embedded Embedding Model

ADR-007: Embedded Embedding Model

Status

Context

Background and Problem Statement

Current Limitations

Decision Drivers

Primary Decision Drivers

Secondary Decision Drivers

Considered Options

Option 1: Embedded Model via fastembed-rs

Option 2: External Embedding API

Option 3: Local Model Server (Ollama, etc.)

Decision

Consequences

Positive

Negative

Neutral

Decision Outcome

Related Decisions

Links

More Information

Audit

2025-01-20