Skip to content

ADR-012: Concurrency Model with Rayon

Accepted

Processing large documents requires efficient parallel execution. When loading multi-megabyte files and generating embeddings for hundreds of chunks, single-threaded execution becomes a bottleneck. The system needs a concurrency model that:

  1. Scales with available CPU cores
  2. Maintains thread safety without complex synchronization
  3. Integrates cleanly with existing sequential code
  4. Handles errors gracefully in parallel contexts
  • Load and chunk 10MB+ files efficiently
  • Generate embeddings for 100+ chunks concurrently
  • Minimize latency for interactive CLI usage
  • Scale from single-core to many-core systems
  1. Work-stealing efficiency: Automatic load balancing across threads
  2. Simple API: Parallel iterators mirror sequential code
  3. Safety guarantees: Compile-time verification of data race freedom
  4. Error propagation: Clean handling of failures in parallel contexts
  1. Zero-overhead abstraction: No runtime cost for unused parallelism
  2. Rust ecosystem: Well-maintained, widely-used crate
  3. Configurable: Thread pool sizing and scheduling options

Data-parallelism library using work-stealing for automatic load balancing.

Pros:

  • Drop-in parallel iterators (.par_iter())
  • Automatic work distribution
  • Scoped threads prevent dangling references
  • Excellent documentation and ecosystem support

Cons:

  • Adds dependency (~100KB)
  • Thread pool overhead for small workloads

Rust standard library threading primitives.

Pros:

  • No external dependencies
  • Full control over thread lifecycle

Cons:

  • Manual thread management
  • No work-stealing
  • Complex error handling across threads

Async runtime for I/O-bound workloads.

Pros:

  • Excellent for network operations
  • Mature ecosystem

Cons:

  • Async/await complexity for CPU-bound work
  • Overhead for non-I/O operations
  • Viral async requirement

Low-level concurrency primitives.

Pros:

  • Fine-grained control
  • Scoped threads

Cons:

  • More boilerplate than Rayon
  • Manual work distribution

We chose Rayon for parallel processing with the following patterns:

semantic_search() uses par_iter() to distribute the O(n) cosine-similarity scan across available CPU cores. This provides near-linear speedup for collections of >1 000 embeddings without changing the public API surface:

let mut similarities: Vec<(i64, f32)> = all_embeddings
.par_iter()
.map(|(chunk_id, embedding)| {
let sim = cosine_similarity(&query_embedding, embedding);
(*chunk_id, sim)
})
.filter(|(_, sim)| *sim >= config.similarity_threshold)
.collect();
// Sort by similarity descending
similarities.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal));
// Limit results
similarities.truncate(config.top_k * 2);

The rayon import is scoped to the function body (use rayon::prelude::*;) to keep the module-level namespace clean.

The ParallelChunker uses Rayon for concurrent chunk processing:

pub struct ParallelChunker {
inner: Box<dyn Chunker>,
thread_count: Option<usize>,
}
impl ParallelChunker {
pub fn chunk(&self, buffer_id: i64, content: &str, path: Option<&Path>) -> Result<Vec<Chunk>> {
// Split content into segments
let segments = self.split_into_segments(content);
// Process segments in parallel
let results: Result<Vec<Vec<Chunk>>> = segments
.par_iter()
.map(|segment| self.inner.chunk(buffer_id, segment, path))
.collect();
// Merge and reindex
Ok(self.merge_chunks(results?))
}
}

SQLite connections are not thread-safe. We use Mutex wrapping:

pub struct SqliteStorage {
conn: Mutex<Connection>,
db_path: PathBuf,
}

This ensures only one thread accesses the database at a time while allowing parallel chunk processing.

Embedding batches can be processed in parallel when chunks don’t share state:

let embeddings: Vec<Embedding> = chunks
.par_chunks(BATCH_SIZE)
.flat_map(|batch| embedder.embed_batch(batch))
.collect();
  1. Linear speedup: Processing time scales with core count
  2. Simple code: Parallel iterators look like sequential code
  3. Safe by default: Compile-time data race prevention
  4. Automatic optimization: Work-stealing balances load
  1. Memory overhead: Each thread needs stack space
  2. Debugging complexity: Parallel execution harder to trace
  3. SQLite serialization: Database becomes bottleneck for write-heavy workloads
  1. Thread pool startup: One-time initialization cost
  2. Batch size tuning: Optimal batch size depends on workload
// Use custom thread pool for specific operations
rayon::ThreadPoolBuilder::new()
.num_threads(4)
.build_global()
.unwrap();

Rayon’s collect::<Result<Vec<_>>>() short-circuits on first error:

let results: Result<Vec<Chunk>> = segments
.par_iter()
.map(|s| process(s)) // Returns Result<Chunk>
.collect(); // Stops on first Err
  • Small workloads (< 10 chunks): Overhead exceeds benefit
  • Sequential dependencies: Operations that must be ordered
  • I/O-bound work: Consider async instead
OperationSequentialParallel (8 cores)Speedup
Chunk 1MB50ms15ms3.3x
Embed 100 chunks2000ms300ms6.7x
Semantic search 1000 chunks100ms20ms5x

Measurements on Apple M1 Pro, representative workloads