Skip to content

Benchmark Regression

Automated performance benchmarking with regression detection using Criterion.rs, so performance regressions are caught before they merge.

FieldValue
Workflow.github/workflows/benchmark-regression.yml
ToolCriterion.rs
TriggersPull requests + manual (workflow_dispatch)
GoalPrevent performance regressions

The workflow runs these stages automatically:

  1. Restore — restore a cached baseline (target/criterion) from a prior run, if one exists.
  2. Run — run the benchmarks under benches/ (if present) on the PR branch.
  3. Report — generate benchmark-report.md from the run output.
  4. Upload — upload the benchmark-results artifact (report, target/criterion/, raw output; 90-day retention).
expensive_function time: [48.123 µs 48.567 µs 49.012 µs]
change: [-5.2341% -3.1234% -1.0123%] (p = 0.02 < 0.05)
Performance has improved.
  • time: current execution time (min, median, max).
  • change: percent change from baseline.
  • p-value: statistical significance (< 0.05 = significant).
ChangeInterpretationAction
< -5%ImprovementGreat — document what improved
-5% to +5%No changeWithin noise threshold
+5% to +20%Minor regression ⚠️Investigate if acceptable
> +20%Major regressionMust fix before merge
change: [+2.5% +5.2% +7.8%] (p = 0.45 > 0.05)
Change within noise threshold.

Not significant — variation likely due to noise, not a real change.

change: [+15.2% +18.5% +21.3%] (p = 0.001 < 0.05)
Performance has regressed.

Significant — real performance degradation detected.

Not automated. The workflow runs the benchmarks and uploads the results, but it does not gate the PR on a regression — its compare step always reports no regression. Use these criteria when reviewing the uploaded artifact (or comparing locally with cargo bench -- --baseline …) to decide whether a change is a real regression:

  1. Statistical significance (p < 0.05).
  2. Magnitude (> 5% slower).
  3. Consistency (median in the regression range).

The workflow writes a benchmark-report.md into the benchmark-results artifact (it does not post a PR comment). The report wraps the tail of the raw cargo bench output; for example:

# Benchmark Results
## Performance Summary
| Benchmark | Baseline | Current | Change |
|-----------|----------|---------|--------|
| parse_small | 1.23 µs | 1.20 µs | -2.4% ✅ |
| parse_large | 45.6 µs | 52.3 µs | +14.7% ⚠️ |
| compute | 234 ns | 236 ns | +0.9% ⚪ |
## Regressions Detected ⚠️
**parse_large**: 14.7% slower (p < 0.01)
- Review recent changes to parsing logic
- Consider optimization or accept tradeoff

Criterion generates HTML reports under target/criterion/:

target/criterion/
├── expensive_function/
│ ├── report/
│ │ ├── index.html
│ │ └── violin.svg
│ └── base/
└── report/
└── index.html

Open target/criterion/report/index.html to view them.

  1. Create a benches/ directory at the crate root:

    benches/
    ├── my_benchmark.rs
    └── another_benchmark.rs
  2. Write a benchmark file, benches/performance.rs:

    use criterion::{black_box, criterion_group, criterion_main, Criterion};
    use rust_template::expensive_function;
    fn benchmark_expensive_function(c: &mut Criterion) {
    c.bench_function("expensive_function", |b| {
    b.iter(|| expensive_function(black_box(100)))
    });
    }
    fn benchmark_with_setup(c: &mut Criterion) {
    c.bench_function("with_setup", |b| {
    let data = vec![1, 2, 3, 4, 5];
    b.iter(|| {
    process(black_box(&data))
    })
    });
    }
    criterion_group!(benches, benchmark_expensive_function, benchmark_with_setup);
    criterion_main!(benches);
  3. Register the bench target in Cargo.toml:

    [[bench]]
    name = "performance"
    harness = false
  4. Verify: cargo bench --bench performance.

Terminal window
# Run all benchmarks
cargo bench
# Run a specific benchmark
cargo bench --bench performance
# Save a baseline
cargo bench -- --save-baseline main
# Compare against a baseline
cargo bench -- --baseline main

Verify: confirm the run prints time: and change: lines.

Tune sampling in benches/benchmark.rs:

use criterion::{Criterion, SamplingMode};
fn custom_criterion() -> Criterion {
Criterion::default()
.sample_size(100) // More samples = more accurate
.warm_up_time(Duration::from_secs(3))
.measurement_time(Duration::from_secs(5))
.noise_threshold(0.05) // 5% noise threshold
.sampling_mode(SamplingMode::Flat)
}
criterion_group! {
name = benches;
config = custom_criterion();
targets = my_benchmark
}

Adjust the CI run duration in the workflow:

.github/workflows/benchmark-regression.yml
inputs:
duration:
default: '300' # 5 minutes

Verify: cargo bench and check the warm-up/measurement times in the output.

  1. Profile to find the hot path:

    Terminal window
    cargo install flamegraph
    cargo flamegraph --bench performance
    Terminal window
    perf record --call-graph dwarf cargo bench
    perf report
  2. Choose an outcome:

    • Fix — optimize the code.
    • Accept — document the tradeoff (e.g., correctness over speed).
    • Defer — open an issue for future optimization.
  3. Document an accepted tradeoff in the source:

    // Intentional tradeoff: Added validation reduces performance by ~10%
    // See issue #123 for optimization ideas
    fn parse(input: &str) -> Result<Output> {
    validate(input)?; // New validation (slower but correct)
    // ...
    }

Verify: re-run cargo bench -- --baseline main and confirm the change is acknowledged or resolved.

Terminal window
# Compare multiple baselines
cargo bench -- --save-baseline main
cargo bench -- --save-baseline before-refactor
cargo bench -- --baseline before-refactor
# Manual significance level
cargo bench -- --baseline main --significance-level 0.05

Measure throughput instead of time/iteration:

c.bench_function("process_bytes", |b| {
let data = vec![0u8; 1_000_000];
b.throughput(Throughput::Bytes(data.len() as u64));
b.iter(|| process_data(black_box(&data)))
});

Verify: throughput output reports MB/s.

Noisy results (change: [-15% +2% +18%] (p = 0.52)) — caused by CPU frequency scaling, background processes, or thermal throttling:

Terminal window
# Increase sample size
cargo bench -- --sample-size 1000
# Disable CPU frequency scaling (Linux)
sudo cpupower frequency-set --governor performance

Missing baseline (Warning: No baseline found for benchmark) — run once on the main branch to establish a baseline.

Slow benchmarks:

Terminal window
cargo bench -- --measurement-time 1
  1. Use black_box to stop the compiler optimizing the work away:

    // ❌ Bad - Compiler optimizes away
    b.iter(|| expensive_function(100));
    // ✅ Good - Prevents optimization
    b.iter(|| expensive_function(black_box(100)));
  2. Keep setup out of measurement:

    // ❌ Bad - Includes allocation in measurement
    b.iter(|| {
    let data = vec![1, 2, 3]; // Measured
    process(&data)
    });
    // ✅ Good - Setup outside measurement
    let data = vec![1, 2, 3];
    b.iter(|| process(black_box(&data)));
  3. Benchmark representative sizes:

    c.bench_function("small input", |b| {
    b.iter(|| parse(black_box("short")))
    });
    c.bench_function("large input", |b| {
    let large = "x".repeat(10_000);
    b.iter(|| parse(black_box(&large)))
    });
  4. Parameterize over inputs:

    use criterion::{BenchmarkId, Criterion};
    fn bench_sizes(c: &mut Criterion) {
    let mut group = c.benchmark_group("parse_sizes");
    for size in [10, 100, 1000, 10000] {
    group.bench_with_input(
    BenchmarkId::from_parameter(size),
    &size,
    |b, &size| {
    let input = "x".repeat(size);
    b.iter(|| parse(black_box(&input)))
    }
    );
    }
    group.finish();
    }
  5. Benchmark hot paths — focus on critical performance code.

  6. Use realistic inputs — production-like data.

  7. Isolate variables — one change at a time.

  8. Accept some variation — ±5% is often noise.

  9. Profile before optimizing — use flamegraph/perf.

  10. Document tradeoffs — sometimes slower is better (safety, correctness).

Performance is a property that erodes silently: a single PR rarely makes the code dramatically slower, but a year of unmeasured changes can. Running benchmarks on every PR and uploading the results turns that slow drift into a reviewable signal you can inspect per change. The statistical significance test (p < 0.05) and the noise threshold exist because microbenchmarks are jittery — without them, every run would look like a regression and the signal would be ignored. Pairing the threshold with profiling (flamegraph/perf) means a flagged regression leads to a root cause, not just a red mark. The template does not yet gate merges on a regression automatically; comparison against a baseline is done by reviewing the artifact or running cargo bench -- --baseline … locally.