Benchmark Regression

Automated performance benchmarking with regression detection using Criterion.rs, so performance regressions are caught before they merge.

Reference

Field	Value
Workflow	`.github/workflows/benchmark-regression.yml`
Tool	Criterion.rs
Triggers	Pull requests + manual (`workflow_dispatch`)
Goal	Prevent performance regressions

CI pipeline stages

The workflow runs these stages automatically:

Restore — restore a cached baseline (target/criterion) from a prior run, if one exists.
Run — run the benchmarks under benches/ (if present) on the PR branch.
Report — generate benchmark-report.md from the run output.
Upload — upload the benchmark-results artifact (report, target/criterion/, raw output; 90-day retention).

Benchmark output

expensive_function      time:   [48.123 µs 48.567 µs 49.012 µs]
                        change: [-5.2341% -3.1234% -1.0123%] (p = 0.02 < 0.05)
                        Performance has improved.

time: current execution time (min, median, max).
change: percent change from baseline.
p-value: statistical significance (< 0.05 = significant).

Performance change thresholds

Change	Interpretation	Action
`< -5%`	Improvement ✅	Great — document what improved
`-5%` to `+5%`	No change ⚪	Within noise threshold
`+5%` to `+20%`	Minor regression ⚠️	Investigate if acceptable
`> +20%`	Major regression ❌	Must fix before merge

Statistical significance

change: [+2.5% +5.2% +7.8%] (p = 0.45 > 0.05)
Change within noise threshold.

Not significant — variation likely due to noise, not a real change.

change: [+15.2% +18.5% +21.3%] (p = 0.001 < 0.05)
Performance has regressed.

Significant — real performance degradation detected.

Regression detection criteria

Not automated. The workflow runs the benchmarks and uploads the results, but it does not gate the PR on a regression — its compare step always reports no regression. Use these criteria when reviewing the uploaded artifact (or comparing locally with cargo bench -- --baseline …) to decide whether a change is a real regression:

Statistical significance (p < 0.05).
Magnitude (> 5% slower).
Consistency (median in the regression range).

Benchmark report artifact

The workflow writes a benchmark-report.md into the benchmark-results artifact (it does not post a PR comment). The report wraps the tail of the raw cargo bench output; for example:

# Benchmark Results

## Performance Summary

| Benchmark | Baseline | Current | Change |
|-----------|----------|---------|--------|
| parse_small | 1.23 µs | 1.20 µs | -2.4% ✅ |
| parse_large | 45.6 µs | 52.3 µs | +14.7% ⚠️ |
| compute | 234 ns | 236 ns | +0.9% ⚪ |

## Regressions Detected ⚠️

**parse_large**: 14.7% slower (p < 0.01)
- Review recent changes to parsing logic
- Consider optimization or accept tradeoff

Criterion HTML reports

Criterion generates HTML reports under target/criterion/:

target/criterion/
├── expensive_function/
│   ├── report/
│   │   ├── index.html
│   │   └── violin.svg
│   └── base/
└── report/
    └── index.html

Open target/criterion/report/index.html to view them.

How-to

Set up benchmarks

Create a benches/ directory at the crate root:

benches/
├── my_benchmark.rs
└── another_benchmark.rs

Write a benchmark file, benches/performance.rs:

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use rust_template::expensive_function;

fn benchmark_expensive_function(c: &mut Criterion) {
    c.bench_function("expensive_function", |b| {
        b.iter(|| expensive_function(black_box(100)))
    });
}

fn benchmark_with_setup(c: &mut Criterion) {
    c.bench_function("with_setup", |b| {
        let data = vec![1, 2, 3, 4, 5];
        b.iter(|| {
            process(black_box(&data))
        })
    });
}

criterion_group!(benches, benchmark_expensive_function, benchmark_with_setup);
criterion_main!(benches);

[[bench]]
name = "performance"
harness = false

Verify: cargo bench --bench performance.

Run benchmarks locally

# Run all benchmarks
cargo bench

# Run a specific benchmark
cargo bench --bench performance

# Save a baseline
cargo bench -- --save-baseline main

# Compare against a baseline
cargo bench -- --baseline main

Verify: confirm the run prints time: and change: lines.

Configure Criterion

Tune sampling in benches/benchmark.rs:

use criterion::{Criterion, SamplingMode};

fn custom_criterion() -> Criterion {
    Criterion::default()
        .sample_size(100)              // More samples = more accurate
        .warm_up_time(Duration::from_secs(3))
        .measurement_time(Duration::from_secs(5))
        .noise_threshold(0.05)         // 5% noise threshold
        .sampling_mode(SamplingMode::Flat)
}

criterion_group! {
    name = benches;
    config = custom_criterion();
    targets = my_benchmark
}

Adjust the CI run duration in the workflow:

inputs:
  duration:
    default: '300'  # 5 minutes

Verify: cargo bench and check the warm-up/measurement times in the output.

Investigate a regression

Profile to find the hot path:

cargo install flamegraph
cargo flamegraph --bench performance

perf record --call-graph dwarf cargo bench
perf report

Choose an outcome:
- Fix — optimize the code.
- Accept — document the tradeoff (e.g., correctness over speed).
- Defer — open an issue for future optimization.

Document an accepted tradeoff in the source:

// Intentional tradeoff: Added validation reduces performance by ~10%
// See issue #123 for optimization ideas
fn parse(input: &str) -> Result<Output> {
    validate(input)?;  // New validation (slower but correct)
    // ...
}

Verify: re-run cargo bench -- --baseline main and confirm the change is acknowledged or resolved.

Run advanced benchmarks

# Compare multiple baselines
cargo bench -- --save-baseline main
cargo bench -- --save-baseline before-refactor
cargo bench -- --baseline before-refactor

# Manual significance level
cargo bench -- --baseline main --significance-level 0.05

Measure throughput instead of time/iteration:

c.bench_function("process_bytes", |b| {
    let data = vec![0u8; 1_000_000];
    b.throughput(Throughput::Bytes(data.len() as u64));
    b.iter(|| process_data(black_box(&data)))
});

Verify: throughput output reports MB/s.

Troubleshooting

Noisy results (change: [-15% +2% +18%] (p = 0.52)) — caused by CPU frequency scaling, background processes, or thermal throttling:

# Increase sample size
cargo bench -- --sample-size 1000

# Disable CPU frequency scaling (Linux)
sudo cpupower frequency-set --governor performance

Missing baseline (Warning: No baseline found for benchmark) — run once on the main branch to establish a baseline.

Slow benchmarks:

cargo bench -- --measurement-time 1

Benchmark best practices

Use black_box to stop the compiler optimizing the work away:

// ❌ Bad - Compiler optimizes away
b.iter(|| expensive_function(100));

// ✅ Good - Prevents optimization
b.iter(|| expensive_function(black_box(100)));

Keep setup out of measurement:

// ❌ Bad - Includes allocation in measurement
b.iter(|| {
    let data = vec![1, 2, 3];  // Measured
    process(&data)
});

// ✅ Good - Setup outside measurement
let data = vec![1, 2, 3];
b.iter(|| process(black_box(&data)));

Benchmark representative sizes:

c.bench_function("small input", |b| {
    b.iter(|| parse(black_box("short")))
});

c.bench_function("large input", |b| {
    let large = "x".repeat(10_000);
    b.iter(|| parse(black_box(&large)))
});

Parameterize over inputs:

use criterion::{BenchmarkId, Criterion};

fn bench_sizes(c: &mut Criterion) {
    let mut group = c.benchmark_group("parse_sizes");

    for size in [10, 100, 1000, 10000] {
        group.bench_with_input(
            BenchmarkId::from_parameter(size),
            &size,
            |b, &size| {
                let input = "x".repeat(size);
                b.iter(|| parse(black_box(&input)))
            }
        );
    }

    group.finish();
}

Benchmark hot paths — focus on critical performance code.
Use realistic inputs — production-like data.
Isolate variables — one change at a time.
Accept some variation — ±5% is often noise.
Profile before optimizing — use flamegraph/perf.
Document tradeoffs — sometimes slower is better (safety, correctness).

Why this matters

Performance is a property that erodes silently: a single PR rarely makes the code dramatically slower, but a year of unmeasured changes can. Running benchmarks on every PR and uploading the results turns that slow drift into a reviewable signal you can inspect per change. The statistical significance test (p < 0.05) and the noise threshold exist because microbenchmarks are jittery — without them, every run would look like a regression and the signal would be ignored. Pairing the threshold with profiling (flamegraph/perf) means a flagged regression leads to a root cause, not just a red mark. The template does not yet gate merges on a regression automatically; comparison against a baseline is done by reviewing the artifact or running cargo bench -- --baseline … locally.