ADR-007: Embedded Embedding Model
ADR-007: Embedded Embedding Model
Section titled “ADR-007: Embedded Embedding Model”Status
Section titled “Status”Accepted
Context
Section titled “Context”Background and Problem Statement
Section titled “Background and Problem Statement”Semantic search requires converting text into embedding vectors that capture meaning. This requires an embedding model. The question is whether to:
- Call an external API (OpenAI, Cohere, etc.)
- Run a local model server
- Embed the model directly in the binary
This decision affects offline capability, privacy, latency, and distribution complexity.
Current Limitations
Section titled “Current Limitations”- API dependencies: External APIs require network, keys, and incur costs
- Server processes: Local servers add deployment complexity
- Privacy concerns: Sending text to external services may leak sensitive data
Decision Drivers
Section titled “Decision Drivers”Primary Decision Drivers
Section titled “Primary Decision Drivers”- Offline capability: Must work without network connectivity
- Privacy: No data should leave the user’s machine
- Zero configuration: Should work out-of-the-box without API keys
Secondary Decision Drivers
Section titled “Secondary Decision Drivers”- Latency: Local inference is faster than API calls
- Cost: No per-token charges
- Reliability: No dependency on external service availability
Considered Options
Section titled “Considered Options”Option 1: Embedded Model via fastembed-rs
Section titled “Option 1: Embedded Model via fastembed-rs”Description: Use fastembed-rs to run ONNX models directly within the Rust binary.
Technical Characteristics:
- ONNX Runtime for inference
- Model downloaded on first use
- Lazy loading to minimize cold start
- Thread-safe singleton pattern
Advantages:
- Fully offline operation
- No API keys or configuration
- Fast local inference
- Privacy preserved (no data leaves machine)
- Consistent results (no API version drift)
Disadvantages:
- Large binary size (ONNX runtime)
- Initial model download required
- CPU-only (no GPU acceleration in default build)
- Memory overhead for model
Risk Assessment:
- Technical Risk: Low. fastembed-rs is production-ready
- Schedule Risk: Low. Drop-in integration
- Ecosystem Risk: Low. ONNX is industry standard
Option 2: External Embedding API
Section titled “Option 2: External Embedding API”Description: Call OpenAI, Cohere, or similar API for embeddings.
Technical Characteristics:
- HTTP client for API calls
- API key management
- Rate limiting and retries
Advantages:
- Smaller binary size
- Access to latest models
- GPU inference on server side
Disadvantages:
- Requires network connectivity
- API key management
- Cost per token
- Privacy concerns
- Latency from network round-trips
Disqualifying Factor: Network dependency and privacy concerns conflict with offline-first CLI design.
Risk Assessment:
- Technical Risk: Low. APIs are well-documented
- Schedule Risk: Low. Simple HTTP client
- Ecosystem Risk: Medium. API changes, pricing changes
Option 3: Local Model Server (Ollama, etc.)
Section titled “Option 3: Local Model Server (Ollama, etc.)”Description: Require users to run a local embedding server.
Technical Characteristics:
- HTTP client to localhost
- External process management
- Model management in server
Advantages:
- Offloads inference to dedicated process
- Potentially GPU acceleration
- Model updates independent of rlm-rs
Disadvantages:
- Additional installation step
- Process management complexity
- Port conflicts possible
Disqualifying Factor: Requiring a separate server conflicts with single-binary distribution goal.
Risk Assessment:
- Technical Risk: Low. HTTP is simple
- Schedule Risk: Medium. Documentation/setup guides needed
- Ecosystem Risk: Medium. Server version compatibility
Decision
Section titled “Decision”Embed the embedding model using fastembed-rs with ONNX runtime.
The implementation will use:
- fastembed-rs for model management and inference
- ONNX Runtime as the inference backend
- Lazy loading to defer model download until first use
- Thread-safe singleton for model instance sharing
- Feature flag (
fastembed-embeddings) to make embeddings optional
Consequences
Section titled “Consequences”Positive
Section titled “Positive”- Offline operation: Works without network after initial model download
- Privacy: No data leaves the user’s machine
- Zero configuration: No API keys or server setup required
- Consistent: Same model version produces reproducible results
- Fast: Local inference avoids network latency
Negative
Section titled “Negative”- Binary size: ONNX runtime adds to binary size
- Initial download: First embedding operation downloads the model (~1.3GB for BGE-M3)
- Memory usage: Model loaded in memory during operation
- CPU-only: No GPU acceleration without custom build
Neutral
Section titled “Neutral”- Feature flag: Embeddings can be disabled for smaller builds
Decision Outcome
Section titled “Decision Outcome”Embedded embeddings via fastembed-rs enable rlm-rs to provide semantic search without external dependencies. The lazy loading pattern minimizes cold start impact for operations that don’t need embeddings.
Mitigations:
- Lazy model loading to preserve cold start for non-embedding operations
- Feature flag for builds that don’t need semantic search
- Fallback embedder (hash-based) when feature is disabled
- Clear messaging during model download
Related Decisions
Section titled “Related Decisions”- ADR-001: Adopt RLM Pattern - Requires semantic embeddings
- ADR-008: Hybrid Search - Uses embeddings for semantic component
- ADR-010: Switch to BGE-M3 - Current model choice
- fastembed-rs - Rust embedding library
- ONNX Runtime - Inference engine
More Information
Section titled “More Information”- Date: 2025-01-15
- Source: v1.0.0 release design decisions
- Related ADRs: ADR-001, ADR-008, ADR-010
2025-01-20
Section titled “2025-01-20”Status: Compliant
Findings:
| Finding | Files | Lines | Assessment |
|---|---|---|---|
| fastembed-rs integration | src/embedding/fastembed_impl.rs | all | compliant |
| Lazy loading singleton | src/embedding/fastembed_impl.rs | L14, L59-80 | compliant |
| Feature flag configured | Cargo.toml | L17-18 | compliant |
| Fallback embedder available | src/embedding/fallback.rs | all | compliant |
Summary: Embedded embedding model fully implemented with lazy loading and fallback.
Action Required: None