Honest DORA: Define a Deployment Before You Measure

This is part eight of Ship the Proof. The series has spent seven parts making a pipeline that produces verifiable artifacts and defends them. This part turns from building the pipeline to measuring it, and the measurement is only honest if you start with a definition almost everyone skips.

Two teams report deployment frequency. One counts every merge to the main branch. The other counts only the moment a verified artifact reaches a healthy running state in production. Their numbers are not comparable, and neither is wrong. They are answering different questions while using the same word. This is the quiet flaw in most DORA dashboards: every metric is built on “a deployment,” and almost nobody defines what that is.

Here is the uncomfortable detail. DORA’s own Four Keys guide enumerates the metrics but does not give a standalone definition of what a deployment is (dora.dev). That is not an oversight you can route around. It means the definition is an operational choice you have to make explicitly, and if you skip it, your metrics inherit whatever implicit definition your tooling happened to encode.

Pin the definition before you measure anything

A workable definition for a pipeline that promotes container images: a deployment is a production promotion of a content digest that reaches a healthy running state. Both halves carry weight.

“Promotion of a content digest” means the event being counted is the specific moment a verified digest is admitted to production. Not a merge. Not a build. Not a manifest push. The exact, attested artifact that runs, which the first part of this series argued is the only honest unit of release. This keeps the metric anchored to the thing that actually serves traffic rather than to an intention upstream of it.

“Reaches a healthy running state” means a promotion that is admitted but never becomes healthy is not a completed deployment. Healthiness is observable from your deployment controller; a GitOps controller that “continuously monitors running applications and compares the current, live state against the desired target state” supplies exactly that signal (Argo CD). The deployment is counted when the promoted digest is synced and healthy, not when the pods were scheduled. The controller is interchangeable. The definition is what matters: deployment frequency, lead time, and recovery time then all count the same well-defined event instead of three subtly different ones.

There are five metrics, not four, and two of them pull against the others

The “Four Keys” name is sticky, but the current model has five, grouped into two families that pull against each other (dora.dev). Throughput covers change lead time, deployment frequency, and failed-deployment recovery time. Instability covers change fail rate and deployment rework rate. The source groups them exactly that way.

The two families are in tension by design. Going faster tends to stress stability. That tension is the entire reason you measure both. Report a single composite “DORA score” and you have averaged away the one thing the metrics exist to expose.

A note on what I am deliberately not doing: I am not quoting performance-tier bands or a target change-fail-rate percentage. Set every threshold from your own service’s baseline. The five-metric structure is the standard. The benchmark numbers that float around are not something to import wholesale.

Decompose lead time so it tells you where to look

“Change lead time is 18 hours” is a number you cannot act on. The actionable version is a breakdown: how much was build, how much was waiting for review, how much sat in an approval gate.

End-to-end pipeline timing can ride on a standardized convention where one exists. The OpenTelemetry CI/CD semantic conventions define a cicd.pipeline.run.duration metric and result attributes (OpenTelemetry CI/CD metrics). Two honesty caveats: those conventions are at development status, not yet stable, and they define no deployment metric, so DORA keys have to be derived from the pipeline signals plus deployment metadata you attach yourself.

For the gates that are not pipeline tasks (the change record, approvals), add a per-gate duration metric of your own, tagged with the gate name and the digest. The validation that it is wired correctly: the sum of per-gate durations plus pipeline run duration should reconcile against the commit-to-healthy-deploy lead time for the same digest. When they reconcile, you can point at the slow gate instead of guessing.

Segment by AI cohort, or the signal cancels out

If AI is writing or assisting a meaningful share of your changes, a single blended DORA number is now actively misleading, and DORA’s own 2025 research says why.

The 2025 report finds AI pulling throughput and stability in opposite directions. It reports “a positive relationship between AI adoption on both software delivery throughput and product performance,” while “AI adoption does continue to have a negative relationship with software delivery stability” (2025 DORA report). It frames AI as “the great amplifier,” magnifying an organization’s existing strengths and weaknesses.

Read that carefully, because the wording is doing honest work. DORA reports a relationship, not a proven cause. The program rests on cross-sectional, self-reported survey data, which supports correlation and prediction rather than causation, and the AI findings in particular have drawn methodology critiques. Treat the direction of the effect as a well-grounded signal and the precise magnitude as uncertain.

Now look at what a blended metric does to even that hedged signal. If AI pushes throughput up and stability down, averaging the AI-assisted and human-only changes together hides both effects. Your deployment frequency ticks up, your change fail rate ticks up, and the dashboard shows a muddy wash. Segment by cohort instead: record per change whether and how AI assisted, then report the five metrics separately for the AI-assisted cohort and the rest. Now you can watch throughput rise in the AI cohort while watching change fail rate in that same cohort, and decide whether your testing and your admission gates are holding the line. That is the difference between steering the amplifier and merely watching it.

For the cohort label to be trustworthy, it has to be recorded as fact at authoring time, not reverse-engineered after the deployment. The next part is about exactly that: recording AI authorship in a way you can later trust, in the commit and in the signed provenance.

Pin the definition before you measure anything

There are five metrics, not four, and two of them pull against the others

Decompose lead time so it tells you where to look

Segment by AI cohort, or the signal cancels out

Sources