Statistics & Output
as-bench doesn't just time a loop and divide. Each bench goes through the same Criterion-style pipeline — adaptive warmup, a sampling plan, then a bootstrap analysis — so the numbers come with confidence intervals and an honest read on noise.
The measurement pipeline
- Warmup. The routine runs in doubling batches until the per-iteration mean (
met) stabilizes withinwarmupTolerance, orwarmupTimeis reached. Warmup lets the JIT/runtime settle and produces themetestimate that sizes everything downstream. - Sampling plan. From
metthe engine picks how many samples to collect and how many iterations each sample runs, aiming to fillmeasurementTime. WithsampleSize: 0(the default) the sample count auto-sizes so each sample represents ~10 ms of work, clamped to[10, 500]. - Timed samples. Each sample times a batch of iterations; the per-iteration time is the sample's value.
- Bootstrap analysis. The sample set is resampled
numResamplestimes to build confidence intervals for the point estimate, and to classify outliers.
Warmup tuning
| Setting | Effect |
|---|---|
warmupTime | Upper bound on warmup (ms). |
warmupMinTime | Earliest warmup may be judged stable. |
warmupTolerance | Relative met drift treated as "stable". 0 disables early exit — warmup always runs the full warmupTime. |
Adaptive warmup (warmupTolerance > 0) exits as soon as two consecutive batches agree, so fast, stable benches don't waste the full window. Set warmupTolerance: 0 for a fixed-time warmup when you want every run identical.
Sampling modes
| Mode | Behavior |
|---|---|
Auto | The engine chooses linear or flat based on met and the target window. |
Linear | Each sample runs an increasing number of iterations; the estimate comes from a regression slope through the origin. |
Flat | Every sample runs the same iteration count. |
If a configuration can't fit the requested samples into measurementTime, the run prints a warning recommending a larger --measure or a smaller --samples.
Reading the output
A standalone bench renders as a card:
fib(20)
───────
time: 46.23 µs [46.17, 46.36]
ops/s: 21,629
samples: 10- time — the point estimate with its
[lower, upper]confidence interval (default 95%). - ops/s — operations per second (
M/GSI prefixes above 1e6). - samples — how many samples the plan collected.
- thrpt — appears when you pass
elementsPerCalltobench(). - outliers — appears only when a bench actually had any.
Suites and deltas
Benches inside a suite() stream into one aligned table. The first bench is the baseline; the rest show a vs baseline multiplier:
fib
───
baseline: fib(15)
benchmark time ops/s vs baseline
───────── ──────────────────── ─────── ───────────
fib(15) 4.15 µs [4.14, 4.15] 241,219 1.00×
fib(20) 45.97 µs [...] 21,753 11.06× slowerThe verdict follows Criterion's rule: a change is "no change" when it is not statistically significant or when the entire confidence interval lies inside the noise band. A green × faster / red × slower only appears when the change clears both bars.
Outliers
Samples are classified with Tukey fences (low/high, mild/severe). The outlier section is shown only when something was actually flagged:
outliers:
parse Player 2 / 25Outliers don't corrupt the estimate — the bootstrap already accounts for the spread — but a high count is a hint that the bench (or the machine) is noisy.
Significance & noise
Two render thresholds control the verdicts; set them in config under render:
| Threshold | Default | Meaning |
|---|---|---|
significanceLevel | 0.05 | p-value below which a change is "significant". |
noiseThreshold | 0.01 | Changes whose CI lies within ±this are reported as "no change". |
JSON output
asb run --json suppresses the human output and writes one machine-readable document to stdout — point estimates, CIs, deltas with p-values and verdicts, and outlier counts. Times are in milliseconds. See the CLI reference.
Next
- Baselines — turn a run into a comparison point.
- Configuration — every tunable.
