Benchmark Regression
Automated performance benchmarking with regression detection using Criterion.rs, so performance regressions are caught before they merge.
Reference
Section titled “Reference”| Field | Value |
|---|---|
| Workflow | .github/workflows/benchmark-regression.yml |
| Tool | Criterion.rs |
| Triggers | Pull requests + manual (workflow_dispatch) |
| Goal | Prevent performance regressions |
CI pipeline stages
Section titled “CI pipeline stages”The workflow runs these stages automatically:
- Restore — restore a cached baseline (
target/criterion) from a prior run, if one exists. - Run — run the benchmarks under
benches/(if present) on the PR branch. - Report — generate
benchmark-report.mdfrom the run output. - Upload — upload the
benchmark-resultsartifact (report,target/criterion/, raw output; 90-day retention).
Benchmark output
Section titled “Benchmark output”expensive_function time: [48.123 µs 48.567 µs 49.012 µs] change: [-5.2341% -3.1234% -1.0123%] (p = 0.02 < 0.05) Performance has improved.- time: current execution time (min, median, max).
- change: percent change from baseline.
- p-value: statistical significance (
< 0.05= significant).
Performance change thresholds
Section titled “Performance change thresholds”| Change | Interpretation | Action |
|---|---|---|
< -5% | Improvement ✅ | Great — document what improved |
-5% to +5% | No change ⚪ | Within noise threshold |
+5% to +20% | Minor regression ⚠️ | Investigate if acceptable |
> +20% | Major regression ❌ | Must fix before merge |
Statistical significance
Section titled “Statistical significance”change: [+2.5% +5.2% +7.8%] (p = 0.45 > 0.05)Change within noise threshold.Not significant — variation likely due to noise, not a real change.
change: [+15.2% +18.5% +21.3%] (p = 0.001 < 0.05)Performance has regressed.Significant — real performance degradation detected.
Regression detection criteria
Section titled “Regression detection criteria”Not automated. The workflow runs the benchmarks and uploads the results, but it does not gate the PR on a regression — its compare step always reports no regression. Use these criteria when reviewing the uploaded artifact (or comparing locally with
cargo bench -- --baseline …) to decide whether a change is a real regression:
- Statistical significance (
p < 0.05). - Magnitude (
> 5%slower). - Consistency (median in the regression range).
Benchmark report artifact
Section titled “Benchmark report artifact”The workflow writes a benchmark-report.md into the benchmark-results artifact (it does not post a PR comment). The report wraps the tail of the raw cargo bench output; for example:
# Benchmark Results
## Performance Summary
| Benchmark | Baseline | Current | Change ||-----------|----------|---------|--------|| parse_small | 1.23 µs | 1.20 µs | -2.4% ✅ || parse_large | 45.6 µs | 52.3 µs | +14.7% ⚠️ || compute | 234 ns | 236 ns | +0.9% ⚪ |
## Regressions Detected ⚠️
**parse_large**: 14.7% slower (p < 0.01)- Review recent changes to parsing logic- Consider optimization or accept tradeoffCriterion HTML reports
Section titled “Criterion HTML reports”Criterion generates HTML reports under target/criterion/:
target/criterion/├── expensive_function/│ ├── report/│ │ ├── index.html│ │ └── violin.svg│ └── base/└── report/ └── index.htmlOpen target/criterion/report/index.html to view them.
How-to
Section titled “How-to”Set up benchmarks
Section titled “Set up benchmarks”-
Create a
benches/directory at the crate root:benches/├── my_benchmark.rs└── another_benchmark.rs -
Write a benchmark file,
benches/performance.rs:use criterion::{black_box, criterion_group, criterion_main, Criterion};use rust_template::expensive_function;fn benchmark_expensive_function(c: &mut Criterion) {c.bench_function("expensive_function", |b| {b.iter(|| expensive_function(black_box(100)))});}fn benchmark_with_setup(c: &mut Criterion) {c.bench_function("with_setup", |b| {let data = vec![1, 2, 3, 4, 5];b.iter(|| {process(black_box(&data))})});}criterion_group!(benches, benchmark_expensive_function, benchmark_with_setup);criterion_main!(benches); -
Register the bench target in
Cargo.toml:[[bench]]name = "performance"harness = false -
Verify:
cargo bench --bench performance.
Run benchmarks locally
Section titled “Run benchmarks locally”# Run all benchmarkscargo bench
# Run a specific benchmarkcargo bench --bench performance
# Save a baselinecargo bench -- --save-baseline main
# Compare against a baselinecargo bench -- --baseline mainVerify: confirm the run prints time: and change: lines.
Configure Criterion
Section titled “Configure Criterion”Tune sampling in benches/benchmark.rs:
use criterion::{Criterion, SamplingMode};
fn custom_criterion() -> Criterion { Criterion::default() .sample_size(100) // More samples = more accurate .warm_up_time(Duration::from_secs(3)) .measurement_time(Duration::from_secs(5)) .noise_threshold(0.05) // 5% noise threshold .sampling_mode(SamplingMode::Flat)}
criterion_group! { name = benches; config = custom_criterion(); targets = my_benchmark}Adjust the CI run duration in the workflow:
inputs: duration: default: '300' # 5 minutesVerify: cargo bench and check the warm-up/measurement times in the output.
Investigate a regression
Section titled “Investigate a regression”-
Profile to find the hot path:
Terminal window cargo install flamegraphcargo flamegraph --bench performanceTerminal window perf record --call-graph dwarf cargo benchperf report -
Choose an outcome:
- Fix — optimize the code.
- Accept — document the tradeoff (e.g., correctness over speed).
- Defer — open an issue for future optimization.
-
Document an accepted tradeoff in the source:
// Intentional tradeoff: Added validation reduces performance by ~10%// See issue #123 for optimization ideasfn parse(input: &str) -> Result<Output> {validate(input)?; // New validation (slower but correct)// ...}
Verify: re-run cargo bench -- --baseline main and confirm the change is acknowledged or resolved.
Run advanced benchmarks
Section titled “Run advanced benchmarks”# Compare multiple baselinescargo bench -- --save-baseline maincargo bench -- --save-baseline before-refactorcargo bench -- --baseline before-refactor
# Manual significance levelcargo bench -- --baseline main --significance-level 0.05Measure throughput instead of time/iteration:
c.bench_function("process_bytes", |b| { let data = vec![0u8; 1_000_000]; b.throughput(Throughput::Bytes(data.len() as u64)); b.iter(|| process_data(black_box(&data)))});Verify: throughput output reports MB/s.
Troubleshooting
Section titled “Troubleshooting”Noisy results (change: [-15% +2% +18%] (p = 0.52)) — caused by CPU frequency scaling, background processes, or thermal throttling:
# Increase sample sizecargo bench -- --sample-size 1000
# Disable CPU frequency scaling (Linux)sudo cpupower frequency-set --governor performanceMissing baseline (Warning: No baseline found for benchmark) — run once on the main branch to establish a baseline.
Slow benchmarks:
cargo bench -- --measurement-time 1Benchmark best practices
Section titled “Benchmark best practices”-
Use
black_boxto stop the compiler optimizing the work away:// ❌ Bad - Compiler optimizes awayb.iter(|| expensive_function(100));// ✅ Good - Prevents optimizationb.iter(|| expensive_function(black_box(100))); -
Keep setup out of measurement:
// ❌ Bad - Includes allocation in measurementb.iter(|| {let data = vec![1, 2, 3]; // Measuredprocess(&data)});// ✅ Good - Setup outside measurementlet data = vec![1, 2, 3];b.iter(|| process(black_box(&data))); -
Benchmark representative sizes:
c.bench_function("small input", |b| {b.iter(|| parse(black_box("short")))});c.bench_function("large input", |b| {let large = "x".repeat(10_000);b.iter(|| parse(black_box(&large)))}); -
Parameterize over inputs:
use criterion::{BenchmarkId, Criterion};fn bench_sizes(c: &mut Criterion) {let mut group = c.benchmark_group("parse_sizes");for size in [10, 100, 1000, 10000] {group.bench_with_input(BenchmarkId::from_parameter(size),&size,|b, &size| {let input = "x".repeat(size);b.iter(|| parse(black_box(&input)))});}group.finish();} -
Benchmark hot paths — focus on critical performance code.
-
Use realistic inputs — production-like data.
-
Isolate variables — one change at a time.
-
Accept some variation — ±5% is often noise.
-
Profile before optimizing — use flamegraph/perf.
-
Document tradeoffs — sometimes slower is better (safety, correctness).
Why this matters
Section titled “Why this matters”Performance is a property that erodes silently: a single PR rarely makes the code dramatically slower, but a year of unmeasured changes can. Running benchmarks on every PR and uploading the results turns that slow drift into a reviewable signal you can inspect per change. The statistical significance test (p < 0.05) and the noise threshold exist because microbenchmarks are jittery — without them, every run would look like a regression and the signal would be ignored. Pairing the threshold with profiling (flamegraph/perf) means a flagged regression leads to a root cause, not just a red mark. The template does not yet gate merges on a regression automatically; comparison against a baseline is done by reviewing the artifact or running cargo bench -- --baseline … locally.