Skip to content

Observability for a Delivery Pipeline, Not Just the App

You can see every span your application emits in production and still be blind to your delivery pipeline. When a deploy stalls, when an attestation fails to verify, when a restore drill quietly stops running, there is often no signal at all — because nobody instrumented the pipeline the way they instrumented the app. The pipeline is software too. It deserves a signal catalog.

A signal catalog answers four questions for every signal worth emitting: what is it, which stage produces it, who consumes it, and what — if anything — pages someone. Writing that down forces a useful discipline. You discover the signals you assumed existed but do not, and you stop alerting on things no human acts on.

Lean on a standard where one exists, and be honest where it does not

Section titled “Lean on a standard where one exists, and be honest where it does not”

Most pipeline signals are application-defined: you emit them, no spec standardizes them. That is fine, as long as you do not dress them up as standards. There is exactly one standardized slice here, and it is worth using.

The OpenTelemetry CI/CD semantic conventions define a cicd.* namespace. cicd.pipeline.run.duration is “Duration of a pipeline run grouped by pipeline, state and result,” cicd.pipeline.run.errors counts “the number of errors encountered in pipeline runs,” and span attributes carry the outcome: cicd.pipeline.name, cicd.pipeline.result, cicd.pipeline.task.run.result (OpenTelemetry CI/CD metrics, spans). Two honesty caveats: the conventions are at Development status, not yet stable, so expect churn. And there is no OTel deployment.* metric — your DORA keys derive from the cicd.* pipeline signals plus deployment metadata you attach yourself.

Everything else in the catalog is application-defined but grounded in the tool or standard that motivates it. Where no industry SLO exists, the catalog says so rather than inventing a threshold. “Set per service” is a real answer; a fabricated number is not.

Each row maps a signal to the stage that emits it, the consumer who reads it, and the alert (if any). Stages are the human-readable phases of the pipeline: build, SBOM generation, vulnerability scan, provenance and signing, promotion, admission, rollout, recovery, and the steady-state maintenance loop.

SignalStageConsumerAlert / SLOGrounded in
cicd.pipeline.run.durationbuilddashboardalert on sustained regression vs baseline; no standard SLOOTel CI/CD
image_digest (sha256:)buildauditnone — it is identity, not a measurementOCI dist-spec
sbom_generated_event (CycloneDX 1.7, attached as referrer)SBOM generationaudit / logalert if SBOM missing for a built digestSyft
vuln_scan_result (+ attestation)vulnerability scandashboard / gateblock on findings above the severity gate (gate is a policy choice)Grype
slsa_provenance_attestedprovenance and signingadmission input / auditalert if provenance missing for a promotable digestSLSA levels
test_result_attested (in-toto predicate)provenance and signingadmission input / auditalert if attestation absent for a promotable digestin-toto test-result
keyless_sign_event (Fulcio cert, Rekor v2 entry, cosign ≥ 2.6.0)provenance and signingadmission input / auditalert on signing failure for a promotable digestcosign signing
attestation_verification_result (fail-closed verify)promotiongate / auditblock on verify failure or orphaned referrerscosign verify
argocd_sync_status / argocd_health_statusrolloutdashboard / DORAalert on OutOfSync for prod or Degraded healthArgo CD
admission_verification_result (deny on missing/invalid attestations)admissiongate / auditblock on admission denyKyverno
rollout_abort_event (runtime fallback to prior stable)rolloutalert / humanalert on abort — it is an incident signalArgo Rollouts
gitops_revert_event (durable revert to prior verified digest)recoveryDORA (recovery time) / auditnone — recovery action; time tracked in DORAArgo CD
restore_drill_executionmaintenanceaudit / dashboardalert if drill overdue or last drill failed (cadence is local policy)SOC 2 trust criteria
sbom_rescan_result (SBOM-first rescan as vuln DB updates)maintenancedashboard / alertalert when rescan surfaces findings above the severity gateGrype
supply_chain_tool_version / tool_driftmaintenancedashboard / alertalert on tool below minimum, a non-SHA-pinned action, or a Rekor shard rotation needing re-anchorGHSA-69fq-xp46-6x23

That is not exhaustive, but it covers the shape. Notice how few rows carry a numeric SLO. Most of these are pass/fail gates or hygiene signals where “passing” is the only target, and the few rate metrics genuinely have no industry-standard threshold.

Most of the catalog is mechanical. Three signals repay extra attention because they catch failures the rest of your observability will not.

Gate duration. End-to-end pipeline timing rides on cicd.pipeline.run.duration, but that lumps together build, review, and approval. Add a per-gate duration metric of your own — gate_duration_seconds tagged with the gate name and the digest — and you can decompose lead time across the gates and point at the slow one. The validation that it is wired correctly: the per-gate durations plus the pipeline run duration should reconcile against the commit-to-healthy-deploy lead time for the same digest. Without it, “lead time is high” is a number you cannot act on.

Attestation-verification result. This is the signal that most teams skip and most regret. When you promote an image between registries, the attestations attached to it are separate OCI referrer manifests, not image layers — and a naive copy by digest leaves them behind. The fail-closed verify at the next boundary then correctly rejects an image whose provenance has vanished. If you do not emit and alert on the verification result, the first you hear of an orphaned-referrer problem is a production admission denial during a deploy. Emit it at every promotion boundary, and alert on both verify failure and orphaned referrers, so you catch the gap at promotion rather than at the production gate.

Restore-drill execution. Recovery capability decays silently. A backup job that has been failing for three weeks looks identical to one that never runs, until the day you need it. Emit a signal each time the restore drill executes and records a result, and alert when the drill is overdue or the last run failed. The cadence is a local policy choice — there is no standard frequency to quote — but the alert on “overdue or failed” is what turns an untested disaster-recovery plan into a tested one. This is also the signal an auditor asks for, because it is the evidence that recovery was exercised rather than merely documented.

The catalog’s most useful column is the alert column, because it forces the question: would a human act on this page? Build duration regressing against baseline, yes. An SBOM missing for a built digest, yes. A signing failure, an admission deny, a rollout abort, a restore drill that went overdue — all yes, because each represents a broken invariant someone has to fix.

The DORA rate metrics, by contrast, are trend signals, not pages. So are branch age and the presence of an AI-provenance trailer. Paging on a slow-moving ratio trains people to ignore the pager. Put those on a dashboard, review them on a cadence, and reserve the alert for the gates and invariants that have actually broken. An observability layer for your pipeline is only worth building if the alerts it produces are ones people act on — everything else is a dashboard, and that is fine. The point is to know which is which before you wire the page.

Discussion

Comments are powered by GitHub Discussions. Sign in with GitHub to ask a question or share how this applies in your org.