ADR-012: Concurrency Model with Rayon

Status

Accepted

Context

Background and Problem Statement

Processing large documents requires efficient parallel execution. When loading multi-megabyte files and generating embeddings for hundreds of chunks, single-threaded execution becomes a bottleneck. The system needs a concurrency model that:

Scales with available CPU cores
Maintains thread safety without complex synchronization
Integrates cleanly with existing sequential code
Handles errors gracefully in parallel contexts

Performance Requirements

Load and chunk 10MB+ files efficiently
Generate embeddings for 100+ chunks concurrently
Minimize latency for interactive CLI usage
Scale from single-core to many-core systems

Decision Drivers

Primary Decision Drivers

Work-stealing efficiency: Automatic load balancing across threads
Simple API: Parallel iterators mirror sequential code
Safety guarantees: Compile-time verification of data race freedom
Error propagation: Clean handling of failures in parallel contexts

Secondary Decision Drivers

Zero-overhead abstraction: No runtime cost for unused parallelism
Rust ecosystem: Well-maintained, widely-used crate
Configurable: Thread pool sizing and scheduling options

Considered Options

Option 1: Rayon (Chosen)

Data-parallelism library using work-stealing for automatic load balancing.

Pros:

Drop-in parallel iterators (.par_iter())
Automatic work distribution
Scoped threads prevent dangling references
Excellent documentation and ecosystem support

Cons:

Adds dependency (~100KB)
Thread pool overhead for small workloads

Option 2: std::thread

Rust standard library threading primitives.

Pros:

No external dependencies
Full control over thread lifecycle

Cons:

Manual thread management
No work-stealing
Complex error handling across threads

Option 3: tokio

Async runtime for I/O-bound workloads.

Pros:

Excellent for network operations
Mature ecosystem

Cons:

Async/await complexity for CPU-bound work
Overhead for non-I/O operations
Viral async requirement

Option 4: crossbeam

Low-level concurrency primitives.

Pros:

Fine-grained control
Scoped threads

Cons:

More boilerplate than Rayon
Manual work distribution

Decision

We chose Rayon for parallel processing with the following patterns:

Parallel Semantic Search

semantic_search() uses par_iter() to distribute the O(n) cosine-similarity scan across available CPU cores. This provides near-linear speedup for collections of >1 000 embeddings without changing the public API surface:

let mut similarities: Vec<(i64, f32)> = all_embeddings
    .par_iter()
    .map(|(chunk_id, embedding)| {
        let sim = cosine_similarity(&query_embedding, embedding);
        (*chunk_id, sim)
    })
    .filter(|(_, sim)| *sim >= config.similarity_threshold)
    .collect();

// Sort by similarity descending
similarities.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal));

// Limit results
similarities.truncate(config.top_k * 2);

The rayon import is scoped to the function body (use rayon::prelude::*;) to keep the module-level namespace clean.

Parallel Chunking

The ParallelChunker uses Rayon for concurrent chunk processing:

pub struct ParallelChunker {
    inner: Box<dyn Chunker>,
    thread_count: Option<usize>,
}

impl ParallelChunker {
    pub fn chunk(&self, buffer_id: i64, content: &str, path: Option<&Path>) -> Result<Vec<Chunk>> {
        // Split content into segments
        let segments = self.split_into_segments(content);

        // Process segments in parallel
        let results: Result<Vec<Vec<Chunk>>> = segments
            .par_iter()
            .map(|segment| self.inner.chunk(buffer_id, segment, path))
            .collect();

        // Merge and reindex
        Ok(self.merge_chunks(results?))
    }
}

Thread Safety for Storage

SQLite connections are not thread-safe. We use Mutex wrapping:

pub struct SqliteStorage {
    conn: Mutex<Connection>,
    db_path: PathBuf,
}

This ensures only one thread accesses the database at a time while allowing parallel chunk processing.

Parallel Embedding Generation

Embedding batches can be processed in parallel when chunks don’t share state:

let embeddings: Vec<Embedding> = chunks
    .par_chunks(BATCH_SIZE)
    .flat_map(|batch| embedder.embed_batch(batch))
    .collect();

Consequences

Positive Consequences

Linear speedup: Processing time scales with core count
Simple code: Parallel iterators look like sequential code
Safe by default: Compile-time data race prevention
Automatic optimization: Work-stealing balances load

Negative Consequences

Memory overhead: Each thread needs stack space
Debugging complexity: Parallel execution harder to trace
SQLite serialization: Database becomes bottleneck for write-heavy workloads

Neutral Consequences

Thread pool startup: One-time initialization cost
Batch size tuning: Optimal batch size depends on workload

Implementation Notes

Configuring Thread Count

// Use custom thread pool for specific operations
rayon::ThreadPoolBuilder::new()
    .num_threads(4)
    .build_global()
    .unwrap();

Error Handling in Parallel Contexts

Rayon’s collect::<Result<Vec<_>>>() short-circuits on first error:

let results: Result<Vec<Chunk>> = segments
    .par_iter()
    .map(|s| process(s))  // Returns Result<Chunk>
    .collect();           // Stops on first Err

When NOT to Use Parallelism

Small workloads (< 10 chunks): Overhead exceeds benefit
Sequential dependencies: Operations that must be ordered
I/O-bound work: Consider async instead

Performance Characteristics

Operation	Sequential	Parallel (8 cores)	Speedup
Chunk 1MB	50ms	15ms	3.3x
Embed 100 chunks	2000ms	300ms	6.7x
Semantic search 1000 chunks	100ms	20ms	5x

Measurements on Apple M1 Pro, representative workloads

ADR-012: Concurrency Model with Rayon

ADR-012: Concurrency Model with Rayon

Status

Context

Background and Problem Statement

Performance Requirements

Decision Drivers

Primary Decision Drivers

Secondary Decision Drivers

Considered Options

Option 1: Rayon (Chosen)

Option 2: std::thread

Option 3: tokio

Option 4: crossbeam

Decision

Parallel Semantic Search

Parallel Chunking

Thread Safety for Storage

Parallel Embedding Generation

Consequences

Positive Consequences

Negative Consequences

Neutral Consequences

Implementation Notes

Configuring Thread Count

Error Handling in Parallel Contexts

When NOT to Use Parallelism

Performance Characteristics

References