ADR-012: Concurrency Model with Rayon
ADR-012: Concurrency Model with Rayon
Section titled “ADR-012: Concurrency Model with Rayon”Status
Section titled “Status”Accepted
Context
Section titled “Context”Background and Problem Statement
Section titled “Background and Problem Statement”Processing large documents requires efficient parallel execution. When loading multi-megabyte files and generating embeddings for hundreds of chunks, single-threaded execution becomes a bottleneck. The system needs a concurrency model that:
- Scales with available CPU cores
- Maintains thread safety without complex synchronization
- Integrates cleanly with existing sequential code
- Handles errors gracefully in parallel contexts
Performance Requirements
Section titled “Performance Requirements”- Load and chunk 10MB+ files efficiently
- Generate embeddings for 100+ chunks concurrently
- Minimize latency for interactive CLI usage
- Scale from single-core to many-core systems
Decision Drivers
Section titled “Decision Drivers”Primary Decision Drivers
Section titled “Primary Decision Drivers”- Work-stealing efficiency: Automatic load balancing across threads
- Simple API: Parallel iterators mirror sequential code
- Safety guarantees: Compile-time verification of data race freedom
- Error propagation: Clean handling of failures in parallel contexts
Secondary Decision Drivers
Section titled “Secondary Decision Drivers”- Zero-overhead abstraction: No runtime cost for unused parallelism
- Rust ecosystem: Well-maintained, widely-used crate
- Configurable: Thread pool sizing and scheduling options
Considered Options
Section titled “Considered Options”Option 1: Rayon (Chosen)
Section titled “Option 1: Rayon (Chosen)”Data-parallelism library using work-stealing for automatic load balancing.
Pros:
- Drop-in parallel iterators (
.par_iter()) - Automatic work distribution
- Scoped threads prevent dangling references
- Excellent documentation and ecosystem support
Cons:
- Adds dependency (~100KB)
- Thread pool overhead for small workloads
Option 2: std::thread
Section titled “Option 2: std::thread”Rust standard library threading primitives.
Pros:
- No external dependencies
- Full control over thread lifecycle
Cons:
- Manual thread management
- No work-stealing
- Complex error handling across threads
Option 3: tokio
Section titled “Option 3: tokio”Async runtime for I/O-bound workloads.
Pros:
- Excellent for network operations
- Mature ecosystem
Cons:
- Async/await complexity for CPU-bound work
- Overhead for non-I/O operations
- Viral async requirement
Option 4: crossbeam
Section titled “Option 4: crossbeam”Low-level concurrency primitives.
Pros:
- Fine-grained control
- Scoped threads
Cons:
- More boilerplate than Rayon
- Manual work distribution
Decision
Section titled “Decision”We chose Rayon for parallel processing with the following patterns:
Parallel Semantic Search
Section titled “Parallel Semantic Search”semantic_search() uses par_iter() to distribute the O(n) cosine-similarity scan across
available CPU cores. This provides near-linear speedup for collections of >1 000 embeddings
without changing the public API surface:
let mut similarities: Vec<(i64, f32)> = all_embeddings .par_iter() .map(|(chunk_id, embedding)| { let sim = cosine_similarity(&query_embedding, embedding); (*chunk_id, sim) }) .filter(|(_, sim)| *sim >= config.similarity_threshold) .collect();
// Sort by similarity descendingsimilarities.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal));
// Limit resultssimilarities.truncate(config.top_k * 2);The rayon import is scoped to the function body (use rayon::prelude::*;) to keep the
module-level namespace clean.
Parallel Chunking
Section titled “Parallel Chunking”The ParallelChunker uses Rayon for concurrent chunk processing:
pub struct ParallelChunker { inner: Box<dyn Chunker>, thread_count: Option<usize>,}
impl ParallelChunker { pub fn chunk(&self, buffer_id: i64, content: &str, path: Option<&Path>) -> Result<Vec<Chunk>> { // Split content into segments let segments = self.split_into_segments(content);
// Process segments in parallel let results: Result<Vec<Vec<Chunk>>> = segments .par_iter() .map(|segment| self.inner.chunk(buffer_id, segment, path)) .collect();
// Merge and reindex Ok(self.merge_chunks(results?)) }}Thread Safety for Storage
Section titled “Thread Safety for Storage”SQLite connections are not thread-safe. We use Mutex wrapping:
pub struct SqliteStorage { conn: Mutex<Connection>, db_path: PathBuf,}This ensures only one thread accesses the database at a time while allowing parallel chunk processing.
Parallel Embedding Generation
Section titled “Parallel Embedding Generation”Embedding batches can be processed in parallel when chunks don’t share state:
let embeddings: Vec<Embedding> = chunks .par_chunks(BATCH_SIZE) .flat_map(|batch| embedder.embed_batch(batch)) .collect();Consequences
Section titled “Consequences”Positive Consequences
Section titled “Positive Consequences”- Linear speedup: Processing time scales with core count
- Simple code: Parallel iterators look like sequential code
- Safe by default: Compile-time data race prevention
- Automatic optimization: Work-stealing balances load
Negative Consequences
Section titled “Negative Consequences”- Memory overhead: Each thread needs stack space
- Debugging complexity: Parallel execution harder to trace
- SQLite serialization: Database becomes bottleneck for write-heavy workloads
Neutral Consequences
Section titled “Neutral Consequences”- Thread pool startup: One-time initialization cost
- Batch size tuning: Optimal batch size depends on workload
Implementation Notes
Section titled “Implementation Notes”Configuring Thread Count
Section titled “Configuring Thread Count”// Use custom thread pool for specific operationsrayon::ThreadPoolBuilder::new() .num_threads(4) .build_global() .unwrap();Error Handling in Parallel Contexts
Section titled “Error Handling in Parallel Contexts”Rayon’s collect::<Result<Vec<_>>>() short-circuits on first error:
let results: Result<Vec<Chunk>> = segments .par_iter() .map(|s| process(s)) // Returns Result<Chunk> .collect(); // Stops on first ErrWhen NOT to Use Parallelism
Section titled “When NOT to Use Parallelism”- Small workloads (< 10 chunks): Overhead exceeds benefit
- Sequential dependencies: Operations that must be ordered
- I/O-bound work: Consider async instead
Performance Characteristics
Section titled “Performance Characteristics”| Operation | Sequential | Parallel (8 cores) | Speedup |
|---|---|---|---|
| Chunk 1MB | 50ms | 15ms | 3.3x |
| Embed 100 chunks | 2000ms | 300ms | 6.7x |
| Semantic search 1000 chunks | 100ms | 20ms | 5x |
Measurements on Apple M1 Pro, representative workloads