Automated LLM CI/CD Pipelines With Instant Rollback

Notes from the Release Cycle — Part V

The most-quoted page that didn’t go out last quarter was the one our observer fired on its own at 2:14 AM. The candidate release had passed the gate, dwelled at 5% for the required four minutes, advanced to 25%, and then sat there. The per-minute quality monitor saw three consecutive sub-threshold readings on the legal-domain slice, halted the rollout, and re-pointed routing to the previous release. By the time the on-call engineer’s notification fired — for the receipt, not for an outage — production traffic had been back on the known-good release for nine minutes.

Nobody needed to do anything. The architecture from the first post in this series describes what the four stages are. This post is about what runs between human approvals — the automation layer under the architecture, the boundary where the pipeline either does the right thing on its own or it doesn’t.

The headline claim: most pipeline decisions should be automated, but not all of them. The boundary matters. The pipeline that automates everything will eventually promote a release a human should have caught; the pipeline that automates nothing has no purpose. Drawing the boundary right is what this post is about.

The automation spectrum

Every pipeline decision sits somewhere on a spectrum from “fires on its own with no human notification” to “refuses to proceed without an explicit signed approval.” Below is where each of the load-bearing pipeline actions sits on that spectrum in our shipped pipeline.

Two of the markers above are red rather than colored by their zone. Those are the asymmetric decisions — the two places where the pipeline takes a strong position about who gets to call what. The auto-rollback trigger fires without asking; you cannot configure it off, because the entire point of having it is that it works at 2:14 AM. The gate-fail override refuses to advance without a written rationale; you cannot configure that off either, because the entire point of having it is that future-you needs to read the reason. Most of the rest of the pipeline is configurable; these two are not.

How auto-rollback actually fires

The most-asked question about auto-rollback is “what stops it from firing for the wrong reason?” The honest answer is: nothing single-handedly. The protection comes from how the trigger is wired.

The Observe stage runs a per-minute scoring loop. Every minute it:

Samples a small set of recent production traces from the active release.
Replays each trace through the active model (not the candidate — we’re scoring what’s actually serving).
Scores each replay using the same calibrated human-anchored judge that drove Gate-2^[1].
Computes a single output-quality score over the sample. Writes it to CanaryHealthSample.

The rollback fires when three consecutive per-minute samples fall below the rollback threshold (default: 0.85 of the gate threshold — so 0.55 if the gate was 0.65). Not one bad minute; three. The three-minute lockout is the noise filter — a single anomalous reading does not trigger anything, but a sustained regression does.

When the lockout breaks, the rollback worker executes:

# In effect — the pipeline runs this on its own. No human ack.
POST /api/v1/releases/<previous_release_sha>/activate
# response in <1s; in-flight drain in ~12s on a ~100-replica service

A receipt fires. The on-call engineer sees a Slack notification for the receipt, not for an outage. They open the receipt; they see the three sub-threshold readings, the elapsed time, and the vindex_sha256_before/after^[2] hashes. Twelve seconds is in-flight drain time; the swap itself is sub-second. By the time the engineer is awake enough to ask “do I need to do something?” the answer is “no, but you should still look at why the gate let this through.”

The actual auto-rollback receipt

This is what the receipt looks like in production. Same hash-chained format documented on the compliance page, with the additional fields specific to an auto-rollback event:

{
  "kind": "auto_rollback",
  "release_id": "rel_a01c66",
  "previous_release_id": "rel_8f72b1",
  "trigger_at": "2026-05-29T02:14:23Z",
  "completed_at": "2026-05-29T02:14:35Z",
  "elapsed_seconds": 12,
  "trigger_reason": "observer_quality_threshold_breach",
  "observer_readings": [
    { "minute_at": "2026-05-29T02:11:00Z", "quality_score": 0.523, "below_threshold": true,  "slice": "legal-IP-licensing" },
    { "minute_at": "2026-05-29T02:12:00Z", "quality_score": 0.508, "below_threshold": true,  "slice": "legal-IP-licensing" },
    { "minute_at": "2026-05-29T02:13:00Z", "quality_score": 0.491, "below_threshold": true,  "slice": "legal-IP-licensing" }
  ],
  "rollback_threshold": 0.55,
  "active_manifest_sha256_before": "9abaeaf6c91f8b...",
  "active_manifest_sha256_after":  "8f72b1de4a93c5...",
  "audit_chain_signature": "sha256(...)",
  "notified_users": ["oncall@customer.example"],
  "notification_sent_at": "2026-05-29T02:14:36Z"
}

The receipt itself is the on-call’s first point of contact. Reading it answers the questions a half-awake engineer would actually ask: what triggered it, which slice failed, by how much, how long the swap took, what’s running now. The next-action prompt from there is usually “go look at why the gate passed this in the first place” — which the receipt for the failing release already has the per-slice Spearman table for.

What the pipeline does NOT do on its own

The corollary to “auto-rollback fires without asking” is that some other things actively cannot. Three explicit refusals.

It does not promote a release that failed the gate without a signed override. A gate-fail marks the release gate_fail; the /activate endpoint refuses to accept the manifest SHA; no command-line incantation works around this. The only path forward is a force-override with forceGateOverride: true AND overrideReason: "<free text>". The reason field is required, free-text, and goes into the audit log alongside the user ID. We’ve designed this so future-you can read why current-you decided the slice regression was acceptable. Three people have used the override path in the wild. Their rationales are still in the audit log.

It does not advance from canary to 100% if any monitor is degrading. If p95 latency, 5xx rate, OR output-quality score is outside its band at the end of a checkpoint dwell, the pipeline halts at that checkpoint and pages. It does not advance and apologize later.

It does not auto-canary a cold-start release. A release with no history of production traffic — fresh fine-tune against a brand-new dataset, say — has nothing to compare its output quality against. The pipeline refuses to start a canary on a cold-start release. We require a 24-hour shadow deployment first, which observes the candidate against real production traces but doesn’t serve any of its responses. After 24 hours we have a quality baseline; then the canary can proceed. Slower; honest; not configurable.

How fast is the recovery, end-to-end?

The recovery time number we publish is 12 seconds. That’s in-flight drain on a ~100-replica service. The manifest swap itself is sub-second. To be useful to a reader, the 12 seconds needs to be decomposed:

0–60 seconds before rollback: the three consecutive sub-threshold readings arrive. The first sub-threshold reading starts the lockout timer. Each subsequent minute extends the lockout if quality is still below threshold.
t = 0: the third sub-threshold reading writes to CanaryHealthSample. The rollback worker observes the third strike and dispatches /activate previous_release.
t < 1 second: the routing layer’s active-release pointer (in Redis) flips. New requests start hitting the previous release.
t = 1 to ~12 seconds: the candidate release continues serving any requests that were in flight when the swap happened. In-flight drain. Some streaming responses take 8–10 seconds to complete naturally, so the cleanup tail is around 12s on a typical service.
t ≈ 13 seconds: the audit log receipt is written and signed. The notification fires.

Compared against the public postmortems we keep citing as anchors: Cloudflare’s June 2022 outage^[3] took 44 minutes from “we know what to revert” to “the revert is complete” — and that was the infrastructure tier. Atlassian’s April 2022 outage^[4] took 12 hours per site because state was split across multiple systems. DORA’s^[5] “elite performer” threshold for failed-deployment recovery is under one hour. Twelve seconds isn’t an order-of-magnitude better than the elite threshold — it’s three orders of magnitude better. The architectural decision that makes it possible is the bundled release manifest from Stage 1. Without the manifest, you don’t have a single object to re-point routing at.

Rollback drills — the unsexy practice nobody runs

Here is the part most teams skip: the only reliable signal that your rollback path works is that you ran a deliberate, scheduled drill and confirmed. Every quarter we run one. The drill goes:

Pick a randomly-scheduled, weekday-business-hours time. Tell the team it’s coming, but not the specific hour.
Inject a synthetic quality regression on the canary slice. (We have a test-mode flag that lets the candidate model respond to a magic header with “I refuse to answer” — guaranteed to fail the calibrated judge.)
Push the test release through the gate (it passes — we’re testing the rollback, not the gate). Start a canary.
Observer notices three sub-threshold readings. Auto-rollback fires.
Wait for the on-call engineer to react. Time how long they take. Note whether they trust the receipt enough to not page back to alarmed.
Verify the audit log shows the test-mode flag in the rollback receipt, so future audits can distinguish a drill from a real incident.

The first drill we ran took 19 seconds end-to-end (12s swap + a 7s settling delay we had to fix). The most recent drill — Q1 2026 — took 12 seconds. The drill never gets to be skipped. Every quarter; every customer cluster.

Most teams have never run a deliberate rollback drill. The first time their rollback path runs is during a real incident, under pressure, with multiple people in the call. The drill is what makes the 12-second number a real number rather than an aspirational one.

What this does not solve

Three honest limitations:

Auto-rollback can ping-pong. If both the candidate AND the previous release are bad — say, the previous release also had a slow-developing slice regression that nobody caught — the pipeline can roll back, then the previous release also fails its post-rollback observer, and there’s no third release to roll back to. The pipeline halts traffic to a maintenance page rather than thrashing. The fix is to keep more than one prior healthy release indexed in the manifest chain so the rollback target is configurable.

The observer adds inference cost. Replaying production traces through the active model on a 5% sample adds roughly 5% to inference spend. We think this is the right trade. Some customers think it’s too expensive on low-margin workloads and want to dial the sample rate down. The dial exists.

A bad judge is worse than no judge. If the calibrated judge that drives the observer is itself miscalibrated — drifted from the human anchor, or trained on a stale corpus — the observer can fire auto-rollback for the wrong reason. The recalibration cadence matters. The Calibrating-the-Judge^[6] piece documents the procedure; the operational requirement is that you actually run it.

FAQ

Why is the rollback trigger three consecutive minutes rather than one?

Because LLM quality scores have noise floor — a single anomalous minute reading can come from a sampling quirk (the 5% trace sample happened to land on a hard slice), not a real regression. The three-minute lockout is the cheapest noise filter that still keeps total reaction time under a minute and a half. We’ve tuned both ways; three is the sweet spot for our customers’ typical traffic shape. The dwell is configurable per release if your traffic shape is different.

Should auto-rollback be configurable to “off”?

In our shipped pipeline, no. The point of having an automated safety mechanism is that it works at 2:14 AM when nobody’s watching. A configurable-off auto-rollback is a sticky note that says “we used to have a safety net.” The argument for making it configurable is that some workloads are too low-stakes to justify any false-positive rollbacks. We think that argument leads to the wrong place — if your workload is too low-stakes for auto-rollback, you don’t need a release pipeline either.

How do you handle the case where the previous release was also bad?

The rollback target is previous_release by default, but the manifest chain stores more history than just N-1. Operators can re-target a rollback to any historically-healthy manifest SHA — /api/v1/releases/<historically_good_sha>/activate — which is the manual-intervention path when the automatic N-1 rollback hits a bad earlier release. The escape valve is there. It’s rare.

What’s the right metric to optimize — MTTR or MTBF?

MTTR — Mean Time To Recovery — by a wide margin, at least for LLM systems. MTBF (Mean Time Between Failures) assumes a deterministic notion of “failure” that LLM workloads don’t have. Output quality drifts continuously; “failure” is a threshold call. Optimizing for fast recovery is robust to where you draw the threshold; optimizing for never failing is brittle and false. DORA’s elite threshold^[5] is itself framed in MTTR terms, which is the right framing.

Do you actually run rollback drills?

Yes — quarterly, scheduled, with a test-mode flag in the receipt so the drill can be distinguished from a real incident in the audit log. The first drill we ran exposed a 7-second settling delay we hadn’t realized was there. The drill is the only way to know the path actually works; reading the runbook is not enough. Most teams have not run one, which is why most teams’ MTTR numbers are aspirational rather than measured.

References

LLM-as-judge calibration. Zheng et al., Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (NeurIPS 2023). The anchor for why a calibrated judge is necessary and why per-slice agreement matters more than aggregate agreement. The observer's per-minute scoring loop depends on this.
vIndex weight-attestation. Documented on the Divinci compliance page and walked through in the regulated-fields post. The `vindex_sha256_before/after` fields in the auto-rollback receipt are the cryptographic anchor an auditor can verify without trusting our logs.
Cloudflare June 2022 outage. Cloudflare outage on June 21, 2022. "06:58: Root cause found and understood. Work begins to revert the problematic change… 07:42: The last of the reverts has been completed." Forty-four minutes to revert at the infrastructure tier, in part because engineers walked over each other's reverts. Anchor for the "manifest-driven swap can't have that failure mode" claim.
Atlassian April 2022 outage. Post-Incident Review: April 2022 Outage. 12 hours per site to restore, 14 days total for 883 sites, because state was split across independently-versioned systems. Anchor for the "bundled release manifest is the thing that makes seconds-not-hours possible" claim.
DORA failed-deployment recovery threshold. DORA — Software delivery performance metrics. The "failed deployment recovery time" elite-performer threshold is documented as under one hour. The 12-second pipeline number is three orders of magnitude below the elite threshold, which is the right way to read the comparison.
Calibrating the AI judge. Our companion post Calibrating the AI Judge. The procedure for keeping the human-anchored judge in calibration over time. The operational claim in this post — that auto-rollback only works as well as the judge driving it — only holds if the judge is in fact periodically recalibrated.
Internal — Divinci pipeline reference. The architecture this automation layer sits under: the four-stage pipeline post. The full API surface is documented at the API reference; the release-management section is the one this post is talking about.

Next in this series: CI Testing for Custom Language Models in 2026. This post is about the operational layer between human approvals. The next is the layer before the pipeline starts — pre-merge CI: what to evaluate at PR time, what kinds of regressions you actually catch before the gate sees them, and which kinds you don’t.

Ready to Build Your Custom AI Solution?

Discover how Divinci AI can help you implement RAG systems, automate quality assurance, and streamline your AI development process.

Get Started Today