The baseline methodology

Scenario

A change passes functional tests and feels fine when you click through alone, but the p95 you care about was never written down. Under dozens of concurrent users, the same build might already be slow or noisy; you would not know until a release or an incident. A baseline turns “it felt OK” into numbers you can compare next week.

The baseline workflow

Step	What you do	What it produces
1. Observe	Run without thresholds at realistic load	Actual p95, error rate, throughput
2. Set	Add thresholds based on observed values + headroom	Script with pass/fail criteria
3. Validate	Re-run with thresholds, confirm consistency	Working quality gate

Why this order matters

Observe before you set thresholds. Use measured values. Guesses drift and break trust in the gate.
Validate before you wire CI. Stable thresholds first. Flaky gates waste the whole team’s time.
Leave headroom on p95. A gap between what you saw and the limit absorbs normal jitter. It still catches real regressions.

The shift

Before baselines	After baselines
“The results looked okay to me”	The test passed with exit code 0
Someone reviews metrics manually	CI/CD decides automatically
Regressions found in production	Regressions blocked at merge

You have test results from Module 2. Numbers came back. But numbers alone don’t tell you whether performance is acceptable. You need a methodology that turns observed values into automated decisions.

That methodology is the baseline workflow, and it has three steps. First, observe. Run your test without thresholds and record what the system actually produces under realistic load. Don’t guess at what the ninety-fifth percentile should be. Measure it.

Second, set thresholds. Take your observed values and add headroom for normal variation. If you measured a ninety-fifth percentile of 320 milliseconds, a threshold of 500 milliseconds gives room for jitter without allowing genuine regressions. Encode these thresholds in your script so the test produces a pass or fail verdict, not just metrics.

Third, validate. Re-run the test with thresholds enabled. If it passes consistently across multiple runs, you have a working baseline. If it’s flaky, either your thresholds are too tight or your system has genuine variability you need to investigate.

Once validated, the baseline becomes a repeatable quality gate. It runs in continuous integration and deployment, catches regressions automatically, and gives the team a shared definition of acceptable performance. The key shift is from “someone looks at the results” to “the test decides.”

The baseline methodology

Scenario

The baseline workflow

Why this order matters

The shift

Script

In this module

Still have questions?

Get every update