10 CI/CD Release Failures in Custom LMs — Which Stage Catches Each

Notes from the Release Cycle — Part II

The first post in this series walked through the four-stage release pipeline we ship — Register → Gate → Roll → Observe. This post is the receipts: ten specific failure modes we’ve now caught with it, what each one looked like in practice, and which stage of the pipeline stopped it from reaching production.

The list is organized by stage, not severity, because the stage tells you where to invest if you’re building something like this yourself. If your gate is the weak link, six of the ten failures below will keep hitting you. If your observer is the weak link, two of them will hit you silently — meaning the only signal you’ll ever get is a customer complaint, which is the worst possible signal.

A pipeline that catches all ten isn’t a list of features. It’s a small number of architectural decisions made consistently. Each failure below names which decision applies.

How to read this list

Each failure is tagged with the stage that catches it:

① REGISTER — the manifest layer. Stops failures where you couldn’t tell which change broke production because state was split across systems.
② GATE — per-domain Spearman vs a calibrated human-anchored judge. Stops failures that hide inside aggregate scores.
③ ROLL — canary at 5% → 25% → 100% with a quality monitor at each checkpoint. Stops failures that only show up at scale.
④ OBSERVE — continuous trace replay through the candidate, scored by the gate’s judge. Stops silent quality drops that latency and 5xx never notice.

Each section ends with the fix — the exact configuration we ship at Divinci, plus what to build yourself if you’re not using us.

Stage ① — Register

1. Co-deploying model + prompt + routing in one bundle and not knowing which one broke it

What happened. We changed three things on the same release: bumped the base model from Gemma 4 E2B to Gemma 4 26B-A4B, edited the legal-domain system prompt to add a “cite the statute” instruction, and adjusted the routing rule that decides which traffic class hits which model. Accuracy on contract drafting dropped 7 points. None of the three changes had been independently tested. Debugging it required reverting one variable at a time over the course of two days.

Why the pipeline catches it now. A Divinci release is an immutable manifest bundling model_ref, prompt_template_ref, routing, and dataset_version into a single SHA-256-addressed artifact. The pipeline refuses to deploy a manifest that bundles more than one change unless the previous release’s SHA is referenced as the comparison baseline. If you want to ship three changes at once, you have to acknowledge it in the manifest, and the failure-attribution path stays clean because the next release is forced back to one-variable-at-a-time.

Fix. Don’t let humans assemble releases by hand. The release manifest should be generated by a pipeline that can’t bundle silently. See Stage 1 — Register for the API.

2. Editing a system prompt in a dashboard and shipping it without code review

What happened. Someone tweaked the system prompt in an admin UI to “make the model less verbose.” It looked like a one-word edit. The resulting prompt was 38 characters shorter, which dropped it below a length threshold the downstream prompt-rewriter used to decide whether to add safety boilerplate. Two hours later the model was answering questions it should have refused.

Why the pipeline catches it now. Prompts are part of the registered manifest. Editing one in a dashboard means cutting a new manifest, which means generating a new SHA, which means the gate runs against the change. You can still edit prompts in a dashboard. You just can’t ship them without the gate seeing them.

Fix. Treat prompts like code: version them with a content hash, register them as part of the release, gate them on the scored-QA suite. Tianpan’s Semver Lie writeup^[1] describes this exact failure mode happening in the wild — a prompt change that “passed code review, deployed without eval gates, hit production without per-user A/B, and triggered no automatic rollback.”

3. Training-serving preprocessing skew

What happened. The training pipeline normalized whitespace and lowercased a particular field. The serving pipeline didn’t. Same model, same prompt, same routing — different inputs at the byte level. On dev fixtures everything passed. On real traffic the model behaved as if it had been re-trained on noisier data, because from its perspective it had been.

Why the pipeline catches it now. The manifest registers a preprocessing_ref alongside model_ref. The gate evaluation runs through the same preprocessing the production serving stack uses. If the two diverge, the gate’s offline numbers no longer match production, and the per-slice Spearman drops in a way that’s measurable before promote.

Fix. Containerize preprocessing as a versioned artifact. Reference it from the manifest. Refuse to deploy if the gate was computed against a different preprocessing version than the one production will use.

Stage ② — Gate

The four failures below are the ones an aggregate-score gate would have shipped. The reason an aggregate gate misses them is structural, not parameter-tuning — averaging across slices destroys exactly the signal you’d use to catch a regression that’s localized to one slice.

4. The IP-licensing collapse (slice-aware regression #1)

What happened. A QLoRA fine-tune improved legal Q&A accuracy on five subdomains and crashed IP-licensing — contract drafting 0.71, statutory interpretation 0.74, case summarization 0.69, regulatory compliance 0.66, jurisdictional analysis 0.62, IP licensing 0.41. The aggregate Spearman ρ across all six was 0.64. The gate threshold was 0.65. By a single aggregate score, the release was a hair under the line. By the per-slice view, one subdomain had collapsed by 27 points.

Why the pipeline catches it now. The gate’s threshold is per-slice, not aggregate. Any single slice falling below its threshold marks the release gate_fail, regardless of how the average looks. The gate-thresholds chart in post #1 is the actual visualization the pipeline produces for releases like this one.

Fix. Slice the gate. The slices that matter are your customer-segment subdomains, not whatever taxonomy is in the eval framework you imported.

5. Pediatric oncology slice regression (slice-aware regression #2)

What happened. A medical Q&A model was fine-tuned on additional adult cardiology data. Aggregate medical accuracy improved 4 points. Pediatric oncology accuracy dropped 11 points — apparently the new training data subtly de-emphasized pediatric dosage adjustments. The aggregate gate would have promoted it.

Why the pipeline catches it now. Pediatric oncology was one of the slices configured by the customer when they registered the scored-QA suite. The Gate-2 evaluation produced a per-slice Spearman ρ that fell from 0.72 to 0.61, below the pediatric-oncology threshold of 0.68. Marked gate_fail. No deploy.

Fix. Customer-defined slices, not platform-defined ones. The platform should let the customer add a slice and a per-slice threshold without writing code — because nobody at Divinci knows your customer’s domain edges as well as your customer does.

6. Multilingual sub-language drift (slice-aware regression #3)

What happened. A multilingual model fine-tuned to improve French responses. Aggregate French accuracy improved 3 points. Within “French,” however, the model now performed worse on Belgian French and Swiss French regional variants — the training corpus had been Parisian-French-heavy. An aggregate French gate would have shipped it.

Why the pipeline catches it now. Locale variants are sub-slices of the language slice. The per-sub-slice Spearman caught the regression in the Belgian variant before promote. The release was returned for either (a) more diverse training data or (b) a force-override with a written justification (“we’re accepting the regional regression because the aggregate French improvement matters more in this rollout”) — and the override goes into the audit trail.

Fix. Slice depth matters. “French” is too coarse. “Belgian French” is the level where regressions actually hide.

7. Bypassing the gate without a written override rationale

What happened. A high-pressure release window. The gate failed on one slice — non-critical, in the team’s judgment. Someone reached for the force-override flag. In an earlier version of the pipeline, force-override was a single boolean. The flag flipped, the release shipped, and three weeks later nobody could reconstruct who decided what about which slice.

Why the pipeline catches it now. Force-override is a two-field gate: forceGateOverride: true AND overrideReason: "...". The reason is a required free-text string written into the audit log alongside the user ID and the per-slice gate result that was overridden. The pipeline refuses the override without the reason. You can still override — you just can’t override anonymously.

Fix. Governance gates aren’t a separate stage. They’re a property of the gate stage: every override is a signed receipt with rationale text.

Stage ③ — Roll

8. Going from 0% to 100% of traffic in one step

What happened. A model passed the gate cleanly. It got pushed to 100% of traffic immediately. On a quirk of conversation length, the new model timed out on responses longer than ~2,400 tokens — a behavior that didn’t surface on the gate’s 100-question evaluation set because every test prompt was short. 15% of users got a timeout for 18 minutes before someone manually rolled back.

Why the pipeline catches it now. The Roll stage holds at 5% for dwell_5pct_seconds (default 240) OR requests_5pct (default 1,000), whichever is later. At 5% traffic, the long-conversation timeouts surface in the 5xx-rate monitor within ~3 minutes. The pipeline refuses to advance past 5% if any checkpoint monitor breaches its band. Mean time to halt was 4 minutes; mean time to a full rollback was about 12 seconds after halt.

Fix. Canary in three steps with a quality monitor, not just latency and 5xx. The “five-percent in twenty seconds and we’re done” pattern is the dangerous one. The five-percent-for-four-minutes pattern is the safe one.

Stage ④ — Observe

The two failures below are the ones an infrastructure-metric canary would have promoted. The reason infrastructure metrics miss them is also structural — latency and 5xx can stay perfectly clean while the model quietly hedges, refuses, or hallucinates.

9. Silent hedging on legal queries (silent quality drop #1)

What happened. A safety-tuned model update made the legal-domain assistant noticeably more conservative. Same latency, same 5xx rate, same token usage. But where the previous version had answered “the statute of limitations is X years,” the new version said “you should consult an attorney.” Customers noticed in hours. The dashboards never moved.

Why the pipeline catches it now. The Stage 4 observer runs continuous replay of production traces through the active model and scores them with the same calibrated judge that powered Gate-2. Hedging shows up immediately because the calibrated judge — anchored to human ratings of what a “good” legal answer looks like — penalizes refusal-when-an-answer-was-expected. The output-quality monitor dropped below its band for three consecutive minutes and the pipeline auto-rolled back. Total elapsed: under five minutes.

Fix. Don’t monitor only latency and 5xx. Monitor a quality score derived from a calibrated judge against actual production traces. SageMaker’s deployment guardrails^[2] auto-rollback on CloudWatch alarms — useful for infrastructure, but the alarm has to fire on a metric, and “model is hedging” is not a metric CloudWatch sees.

10. Hallucinated dates after a fine-tune (silent quality drop #2)

What happened. A scheduling-assistant fine-tune started confidently inserting dates that didn’t exist in the input. “Your meeting is on Thursday March 32nd.” Latency unchanged. 5xx rate unchanged. The hallucinations passed the safety filter because nothing flagged “March 32nd” as harmful — just impossible.

Why the pipeline catches it now. The observer’s calibrated judge — running on real production scheduling traces, not synthetic ones — gives confident-but-wrong answers a worse score than appropriate “I don’t know” refusals. The hallucination-class drop triggered the per-minute observer threshold within two minutes. Auto-rollback fired.

Fix. A judge that’s calibrated against domain expertise. Generic LLM-as-judge will miss “Thursday March 32nd” the same way humans skimming will miss it. Domain-calibrated judges — anchored against domain expert ratings — won’t.

The 10 failures mapped to the pipeline

The bars colored in red are the failures we found while shipping this pipeline — they’re the reason we ended up specifically building the slice-aware gate and the trace-replay observer, instead of shipping a generic canary with infra metrics like everyone else does.

What makes LLM CI/CD different from software CI/CD?

The short version: an LLM release is not a deterministic artifact. The same prompt produces different outputs across runs. The same evaluation set produces different scores across hardware. The same model can pass an aggregate quality check while silently failing on a slice you didn’t include in the eval. Most of the assumptions that traditional CI/CD was built on do not survive contact with a probabilistic system.

Three concrete consequences:

You can’t write expect(output).toEqual(X) assertions. You need a distribution-aware evaluation that consumes rank correlation against a human-anchored grader, not equality against a fixture.
A “passed CI” model can ship broken behavior. The CI passes mean the code runs. They don’t mean the model is right. The release pipeline needs to enforce a quality gate on top of the correctness gate that CI provides.
Rollback isn’t optional and isn’t slow. Because failure modes are probabilistic — and because some of them are silent at the infrastructure layer — the rollback path has to be primary infrastructure, not a backup plan. The release manifest exists specifically to make rollback atomic.

The first post in this series describes the four-stage architecture that responds to these consequences. This post describes the failures it catches.

How do you build a failure-resistant CI/CD pipeline for custom LMs?

The honest answer: you accept that failures will happen and you minimize the time between failure occurring and production traffic returning to a known-good version. The four-stage pipeline above is a specific implementation of that principle, but the principle itself is what matters.

If you’re not using Divinci and you want to build something equivalent, the load-bearing pieces are:

An immutable release manifest that bundles model + prompt + routing + dataset + preprocessing into one SHA. This is what makes 1, 2, and 3 catchable. (Stage 1)
A per-slice gate with thresholds defined by domain owners, not platform owners. This is what makes 4, 5, 6 catchable. (Stage 2)
A canary with quality monitoring at each checkpoint, not just latency and 5xx. This is what makes 8 catchable and what makes 9 and 10 survivable once they hit production. (Stage 3)
A continuous observer that scores actual production traces through the active model with the same calibrated judge that powered the gate. This is what makes 9 and 10 catchable. (Stage 4)
A signed audit receipt for every decision. Hash-chained, externally anchorable. For open-weights model backings, the receipt embeds a vIndex weight-attestation proving the active weights are what the manifest registered. For closed-API backings, the receipt covers the decision chain but cannot claim weight provenance — and the audit trail says so explicitly.

The pieces aren’t novel individually. Every MLOps platform has one or two of them. The combination — slice-aware gate + production-trace observer + atomic rollback + provable receipt — is the part nobody else ships in 2026.

Where to go next

The companion post — How to Build an LLM CI/CD Pipeline With Divinci AI — covers the architecture and the API.
The compliance page documents the vIndex receipt format that backs every release decision and how it maps onto EU AI Act, GDPR Article 17, HIPAA, and NIST AI RMF.
The AutoRAG product page covers the RAG-side hallucination reduction that pairs naturally with the calibrated judge that drives Gate-2 and the Stage-4 observer.
The API reference — every command referenced in this series is a real endpoint.

FAQ

What is the most common CI/CD failure for custom language models?

Across the releases we’ve shipped, the single most damaging failure is a slice-aware regression that passes an aggregate gate — a model that improves on average while quietly collapsing on a specific subdomain (failures 4, 5, and 6 above). It is more common than missing rollback, more common than prompt drift, and harder to detect than either. The fix is structural, not parameter-tuning: gate per slice, not on the average.

How fast should you be able to roll back a bad LLM release?

Order-of-magnitude seconds, not minutes. The mean rollback time on the Divinci pipeline is about 12 seconds — that’s in-flight request drain on a ~100-replica service, not the manifest swap itself, which is sub-second. The architectural decision that makes this possible is the bundled release manifest: because every component (weights, prompt, routing, dataset) is referenced from one SHA, the rollback is a single atomic re-point. Compare this to public postmortems: Cloudflare’s June 2022 incident^[3] took 44 minutes to revert because engineers were stepping on each other’s reverts; Atlassian’s April 2022 outage^[4] took 12 hours per affected site to restore because state was spread across multiple systems.

Why do prompt changes cause so many production outages?

Because prompts are routinely edited outside the CI/CD pipeline — in dashboards, in admin UIs, sometimes by people without engineering review. They are treated as configuration, but they behave like code. A 38-character edit to a system prompt can change downstream model behavior more than a model retraining. The fix is to register prompts as part of the release manifest and require them to pass the same gate the model passes.

How do you detect silent quality degradation in LLM outputs?

Not with infrastructure metrics. Latency, 5xx rate, and token usage will not catch hedging, refusal-when-an-answer-was-expected, or hallucinated dates. The detection signal has to come from a quality score computed by a calibrated judge against actual production traces. The Stage 4 observer in Divinci’s pipeline replays a rolling sample of production traces through the active model, scores them with the same human-anchored Spearman judge that powered Gate-2, and triggers automatic rollback when the quality score drops below threshold for three consecutive minutes.

What audit trail requirements apply to AI model deployments?

The EU AI Act, GDPR Article 17 (right to erasure), HIPAA, and the NIST AI Risk Management Framework all require organizations to maintain records of model versions, evaluation results, approval decisions, and rollouts. The unstated requirement underneath all four is that the records have to be verifiable — auditable means more than “we have a log.” Divinci’s vIndex receipts are hash-chained and externally anchorable, which means an auditor can verify the chain without trusting our logs. For open-weights model backings the receipt also embeds a weight-attestation; for closed-API backings the receipt explicitly notes that weight provenance is not claimed.

References

Tianpan — The Semver Lie: how an LLM minor update breaks production (April 2026). Names the dashboard-prompt-edit failure mode directly. Companion: LLM postmortem template — fields SRE missed.
AWS SageMaker — Use canary traffic shifting. The standard infrastructure-metric-driven auto-rollback. Useful comparison for what Stage 4 Observe is doing differently (quality score, not CloudWatch alarms).
Cloudflare — Cloudflare outage on June 21, 2022. 44-minute revert because engineers walked over each other's reverts. Cited as the "rollback is its own kind of incident" anchor.
Atlassian — Post-Incident Review: April 2022 Outage. 12 hours per site to restore. State-spread-across-systems failure mode in its worst form.
DORA — Software delivery performance metrics. The "failed deployment recovery time" elite-performer threshold is documented as under one hour. Useful framing for "how fast is fast enough" on rollback.
Zheng et al., Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (arXiv:2306.05685, 2023). The reference for why LLM-as-judge can match human ratings overall but vary widely per category — which is exactly the pattern that makes per-slice gating necessary.

Next in this series: Validating and releasing custom LMs in regulated fields. The pipeline above is the architecture. The compliance pathway is the practice of using it. EU AI Act, GDPR Article 17, HIPAA, and NIST AI RMF — what each one asks of a release process, and which vIndex receipt fields cover which requirement.

Ready to Build Your Custom AI Solution?

Discover how Divinci AI can help you implement RAG systems, automate quality assurance, and streamline your AI development process.

Get Started Today