Slide 2 of 9

The baseline methodology

Scenario

A change passes functional tests and feels fine when you click through alone, but the p95 you care about was never written down. Under dozens of concurrent users, the same build might already be slow or noisy; you would not know until a release or an incident. A baseline turns “it felt OK” into numbers you can compare next week.

The baseline workflow

StepWhat you doWhat it produces
1. ObserveRun without thresholds at realistic loadActual p95, error rate, throughput
2. SetAdd thresholds based on observed values + headroomScript with pass/fail criteria
3. ValidateRe-run with thresholds, confirm consistencyWorking quality gate

The baseline methodology: observe, set, validate

Why this order matters

  • Observe before you set thresholds. Use measured values. Guesses drift and break trust in the gate.
  • Validate before you wire CI. Stable thresholds first. Flaky gates waste the whole team’s time.
  • Leave headroom on p95. A gap between what you saw and the limit absorbs normal jitter. It still catches real regressions.

The shift

Before baselinesAfter baselines
“The results looked okay to me”The test passed with exit code 0
Someone reviews metrics manuallyCI/CD decides automatically
Regressions found in productionRegressions blocked at merge

Script

You have test results from Module 2. Numbers came back. But numbers alone don’t tell you whether performance is acceptable. You need a methodology that turns observed values into automated decisions.

That methodology is the baseline workflow, and it has three steps. First, observe. Run your test without thresholds and record what the system actually produces under realistic load. Don’t guess at what the ninety-fifth percentile should be. Measure it.

Second, set thresholds. Take your observed values and add headroom for normal variation. If you measured a ninety-fifth percentile of 320 milliseconds, a threshold of 500 milliseconds gives room for jitter without allowing genuine regressions. Encode these thresholds in your script so the test produces a pass or fail verdict, not just metrics.

Third, validate. Re-run the test with thresholds enabled. If it passes consistently across multiple runs, you have a working baseline. If it’s flaky, either your thresholds are too tight or your system has genuine variability you need to investigate.

Once validated, the baseline becomes a repeatable quality gate. It runs in continuous integration and deployment, catches regressions automatically, and gives the team a shared definition of acceptable performance. The key shift is from “someone looks at the results” to “the test decides.”