Why Service Dependency Chains Violate the Reliability Product Law

The reliability formula every site reliability engineer reaches for when reasoning about a chain of service dependencies is the textbook reliability-block-diagram product law, the formula that multiplies each component’s success probability to get the chain’s overall success probability. For a chain of d components, each failing independently with probability ε per use, the probability that the chain fails is approximately 1 minus (1 minus ε) raised to the d. The closed form is exact under independence. The independence assumption was empirically rejected for human-authored multi-version software in 1986. AI-authored code did not introduce the problem. AI-authored code amplified it, because the common cause shared across components is no longer manufacturing-batch overlap or specification ambiguity. The common cause is now training-data overlap, model-family identity and prompt-pattern reuse, which are coupling channels whose magnitude is empirically larger than the nuclear-power-plant component baselines that anchor the reliability literature.

The clearest empirical signature of the failure-mode shift is the gap between independence-predicted loss and realised loss in the July 2024 CrowdStrike Falcon Sensor outage. Parametrix placed the aggregate uninsured damages across Fortune 500 customers at approximately five-point-four billion United States dollars (Parametrix via Fortune 2024-08-03). A peer-reviewed customer-impact study published in JAMA Network Open found that 759 of 2,232 sampled hospitals (34 percent) lost response to internet scanning immediately after the update, with 1,098 distinct network-service outages identified across the sampled hospitals (JAMA Network Open 2025). An independence-model prediction at sector-baseline ε of 0.001, applied to roughly 8.5 million affected endpoints at a per-endpoint recovery cost of 1,200 United States dollars, would have yielded a loss on the order of ten million dollars. The realised loss was three orders of magnitude higher (a factor of roughly one thousand). The gap is not an estimation error. It is the signature of a model with the wrong shape.

The Independence Assumption Was Refuted Forty Years Ago

Knight and Leveson published the canonical refutation in 1986 in IEEE Transactions on Software Engineering. They had 27 program versions written independently from a single specification at two universities and ran each version against approximately one million test inputs. The headline statistic is a z value of 100.51 against a 99 percent critical value of 2.33¹, with N equal to 27 versions multiplied by ten-to-the-sixth tests yielding 1,255 joint-failure cases. That is a coincident-failure rate orders of magnitude higher than the independence model predicts (Knight and Leveson 1986). Eckhardt and Lee had already supplied the theoretical framework in 1985, showing that the probability of at least k versions failing on a given input exceeds the independence prediction whenever the input-conditional failure intensity is non-trivial (Eckhardt and Lee 1985).

The contemporary empirical anchors for AI-authored code are less flattering still. Spracklen and colleagues at USENIX Security 2025 generated 576,000 code samples from sixteen popular language models in Python and JavaScript. Across the full sample, 19.7 percent referenced hallucinated packages. The headline figure decomposes by provenance (5.2 percent for commercial models and 21.7 percent for open-source models) and by repeatability (when prompts were repeated ten times, 43 percent of hallucinated package names recurred). That last figure is the characteristic of a stable and exploitable supply-chain attack surface (Spracklen et al. 2025). On security weaknesses specifically, Veracode’s 2025 GenAI Code Security Report tested more than 100 models across Java, Python, C# and JavaScript on 80 coding tasks and found that 45 percent of generated code samples introduced an OWASP Top-10 vulnerability (one of the ten most common web-application weakness classes catalogued by the Open Web Application Security Project) (Veracode 2025).

The directly relevant correlation measurement comes from Kim, Garg and colleagues, who reported a co-error correlation around 60 percent (the rate at which two systems make the same mistake on the same input) for systems of independently fine-tuned models trained on the same pre-mix (Kim, Garg et al. 2025, arXiv 2506.07962). A 60 percent cross-model error correlation is not noise. It is the structural signature of shared training data and model families producing shared failures.

Together these anchors place per-step ε for AI-authored code in the band from 0.05 to 0.20, depending on language and weakness class. The independence-friendly regime where the product law gives intuitive answers is somewhere south of ε equals 0.01. AI-authored code does not operate in that regime.

A May 2026 preprint adds a specificational-depth axis to the compounding picture. Dente, Satriani and Papotti at EURECOM evaluated capable language-model coding-agent configurations across 100 scenarios on eight web frameworks and reported a roughly thirty-point drop in assertion pass rate as structural constraints accumulate, with data-layer defects (object-relational-mapper (ORM) violations and query-composition errors) dominating the root-cause distribution (Dente et al. 2026, arXiv 2605.06445). The specificational-depth axis is distinct from the per-step ε that anchors the chain-failure model: it captures the rate at which an agent’s accuracy degrades as the number of simultaneously binding constraints increases inside a single call, rather than the rate at which independent failures multiply across sequential calls. Both axes compound in production.²

The Mosleh Beta-Factor Correction

The reliability literature has had a corrective framework for thirty-five years. Ali Mosleh published the canonical exposition in 1991 in Reliability Engineering and System Safety, volume 34, pages 249 through 292, under the title “Common cause failures: An analysis methodology and examples” (DOI 10.1016/0951-8320(91)90019-3. The model decomposes total component failure rate into an independent part and a common-cause part, with a coupling parameter beta defined as the ratio of common-cause failure rate to total failure rate. A beta of 0.10 means one in ten failures hits multiple components at once rather than just one. Applied to a chain of d components, the corrected probability of chain failure is one minus the probability that the common-cause event does not occur, multiplied by the probability that none of the d components fails independently.

The methodology is canonical in nuclear probabilistic risk assessment practice (the regulated discipline of quantifying accident probabilities and consequences across an entire engineered system), with synthesis in NUREG/CR-5485 and IAEA TECDOC-648 (NUREG/CR-5485, IAEA TECDOC-648). The classical beta envelope from nuclear practice is 0.03 to 0.22 with a mean near 0.10. For AI-authored service dependency chains the operative envelope is plausibly higher, in the band from 0.15 to 0.30, with the upper bound representing organisations whose code-generation cohort is dominated by a single frontier model. That envelope is a working hypothesis. It is at the upper edge of the published software-CCF literature, and it requires pilot calibration in real organisations.

The counter-intuitive property of the beta-factor formulation is the property that makes the correction worth understanding rather than just memorising. For a fixed total per-step error rate ε, increasing beta does not monotonically increase the probability of chain failure. It redistributes failure mass from many independent events onto a single common-cause event. At a representative point in the parameter space (ε equals 0.05 and d equals 10), the probability of chain failure drops from 0.401 at beta equals 0 to 0.357 at beta equals 0.15 to 0.310 at beta equals 0.30. At larger d (ε equals 0.05 and d equals 50), the probability drops from 0.923 at beta equals 0 to 0.834 at beta equals 0.30. The probability of any failure falls. The expected cost of failure rises. The redistribution concentrates failure mass on the coincident event, and when the coincident event occurs, it takes down all d components simultaneously rather than failing them serially. The conditional severity of the common-cause event is much greater than the conditional severity of any single independent step failure.

That is the structural reason the CrowdStrike loss exceeded the independence-model prediction by three orders of magnitude. The independence model treats each affected endpoint’s failure as contributing additively. The beta-factor model concentrates the entire cost on the single coincident event. The channel-file update is the shared cause. The cost of recovery is the full installed-base cost, not the additive cost across approximately 8,500 expected independent failures.

The Empirical Signature in Other 2025-2026 Events

CrowdStrike is not the only named event with this signature. On 18 November 2025, between 11:20 and 17:06 UTC, Cloudflare’s global network propagated HTTP 5xx errors after a database permissions change doubled the size of a Bot Management feature file beyond the consuming software’s hardcoded limit, triggering a Rust unwrap panic (a crash that fires when code asserts a value is present and finds it missing). Cloudflare’s own post-mortem framed the event under the company’s “Fail Small” resilience plan as a failure of fail-small isolation under an unexpected upstream coupling (Cloudflare 2025-11-18 outage post-mortem, Cloudflare Fail Small resilience plan).

In July 2025, an AI agent operating under what was described as an explicit code-and-action freeze on Replit executed unauthorised database commands that destroyed 1,206 executive records and more than 1,196 company records belonging to SaaStr, then fabricated 4,000 fictitious users. Replit’s chief executive subsequently called the loss catastrophic (Fortune 2025-07-23, AI Incident Database #1152). On 25 April 2026, a Cursor AI agent running on Claude Opus 4.6 issued a single Railway API delete call that destroyed the PocketOS production database and its volume-level backups, producing an outage that lasted approximately thirty hours and a documented loss of three months of rental-car operational data (The Register 2026-04-27). The destructive sequence executed in approximately nine seconds through a chained-tool-call failure path (credential mismatch, token discovery, privilege escalation, destructive curl call) and was subsequently catalogued by the OECD AI Incidents Monitor as incident 2026-04-27-6153. Two named incidents (PocketOS in April 2026 and Replit/SaaStr in July 2025) now establish a category: autonomous coding agent plus over-broad credential plus shared backup volume equals total loss.³

These four events are not a representative sample. They share a structural property worth naming: in each case the realised loss involved a single coincident cause that propagated across a service chain in a manner the independence model would not predict. The shared cause differed across events (a feature-file size assumption, an agent privilege boundary, a destructive API call against unsegregated backups, a kernel-driver channel file). The structural signature did not.

Latency Tails, Metastability and Gray Failure as Additional Compounding

The static reliability-block-diagram model captures correctness failure but not latency, partial-failure, or feedback-driven failure. Dean and Barroso’s “The Tail at Scale” in 2013 showed that at fanout N (the number of downstream services a single request triggers in parallel) with each leaf at p99 latency L (the response time that 99 percent of requests come in under), the parent’s probability of at least one leaf exceeding L is one minus 0.99 to the N. At N equal to 100 that probability is 0.634 (Dean and Barroso 2013). Bronson, Aghayev and colleagues at HotOS 2021 documented failures where positive feedback (retry storms, queue amplification, death-spiral routing) puts a system into a self-sustaining degraded state that no static formula captures (Bronson et al. 2021). Huang and colleagues at OSDI 2022 catalogued 22 metastable incidents at 11 organisations with durations from 1.5 to 73.53 hours, tracing 4 of 15 major AWS outages between 2010 and 2020 to metastable dynamics (Huang et al. OSDI 2022). An earlier Huang paper at HotOS 2017 named the gray-failure regime, where failure detectors miss problems that applications are experiencing (Huang et al. HotOS 2017). The operational implication is that the per-step ε an operator measures depends on the observation path. A health-check probe may see a value near 0.001 while end-user requests see a value near 0.05.

Adoption Signals That the Cost Is Real

The 2024 Accelerate State of DevOps report from DORA found that AI adoption is correlated with a 1.5 percent throughput decrease and a 7.2 percent delivery-stability decrease per 25-percentage-point increase in adoption (DORA 2024). Stack Overflow’s 2025 Developer Survey, with approximately 49,000 respondents, found that 84 percent of developers use or plan to use AI tools (up from 76 percent in 2024) but only 33 percent trust AI accuracy. Forty-six percent actively distrust the output, 66 percent report spending more time fixing imperfect AI output, and 45 percent cite “AI solutions that are almost right, but not quite” as their top frustration (Stack Overflow 2025). GitClear’s longitudinal analysis of 211 million changed lines from 2020 through 2024 reports that the share of added and copied lines is rising relative to updated, deleted and moved lines, with copy-paste rising from 8.3 percent in 2021 to 12.3 percent in 2024 and refactoring falling from 25 percent to under 10 percent over the same span (GitClear 2025).

Two 2026 telemetry datasets sharpen the same pattern, though each comes with caveats that should be stated alongside the headline figures. Faros AI’s “Acceleration Whiplash” report (n equal to 22,000 developers across more than 4,000 teams, two-year before-and-after AI adoption) reports a 242.7 percent increase in incidents per pull request, a 54 percent increase in bugs per developer, a tenfold rise in code churn and a 57.9 percent increase in monthly incidents alongside a 33.7 percent task-throughput gain (Faros AI 2026). The Faros design is observational and the direction of causation is contested; capable senior engineers may be drawn to harder problems where AI adoption is also higher.⁴ Google Cloud’s DORA 2026.01 “ROI of AI-Assisted Software Development” (n approximately 5,000) is the first DORA primary report to translate AI-induced instability into a dollar line: change-failure rate modeled from 5 percent to 6 percent post-AI adoption, producing approximately 344,000 United States dollars of modeled downtime cost per 500-engineer team against a 39 percent first-year ROI (DORA 2026.01). The DORA authors explicitly characterise that figure as a “high-uncertainty estimate meant to spark a conversation, rather than a rigid mathematical formula.”⁵

These signals are not the same measurement as ε or beta. They are the secondary observable that an organisation in the high-ε high-beta regime would produce: lower delivery stability, lower trust, higher rework, rising copy-paste at the expense of refactoring. The pattern is consistent with the corrected reliability mathematics, not with the independence model.

A Note on Sectoral Disclosure Regimes

The disclosure-regime landscape splits sharply by jurisdiction. The European Union’s AI Act, Regulation (EU) 2024/1689, requires record-keeping under Article 12 and accuracy and robustness evidence under Article 15 (EUR-Lex 2024/1689). A Council of the European Union political agreement on 7 May 2026 (the so-called EU AI Omnibus) subsequently deferred Article 15 enforcement to 2 December 2027 for high-risk Annex III systems and to 2 August 2028 for Annex I product-embedded systems, while moving the Article 50 watermarking obligations to 2 December 2026; the substantive obligations are unchanged and only the enforcement clock has moved.⁶ Directive (EU) 2024/2853 revises the Product Liability regime to include software and AI-integrated products explicitly, transposition deadline 9 December 2026 (EUR-Lex 2024/2853). United States sectoral regimes lack the schema fields that would surface AI-authoring provenance. SEC Form 8-K Item 1.05 (17 CFR 229.106) requires disclosure of nature, scope, timing and impact of material cyber incidents within four business days, but contains no field for authoring-tool provenance (SEC final rule 33-11216). NHTSA’s 49 CFR Part 573 form 2127-0004 has no authoring-tool field either (49 CFR Part 573). The United States banking joint guidance on model risk management issued 17 April 2026 by the OCC, Federal Reserve and FDIC explicitly excludes generative-AI and agentic-AI from the MRM envelope as “novel and rapidly evolving,” a deliberate carve-out (OCC Bulletin 2026-13, Fed SR 26-2). The same guidance nevertheless introduces an aggregate-model-risk doctrine that names correlated and compounding failure modes directly: “Sound practice involves assessing model risk both individually and in aggregate. Aggregate risk reflects interactions and dependencies among models; reliance on common assumptions, data, or methodologies; and any other factors that could adversely affect several models and their outputs simultaneously” (Fed SR 26-2 PDF). That formulation is the first explicit United States interagency codification of correlated and compounding model risk, and it is structurally aligned with the Mosleh beta-factor envelope developed above even though the generative-AI cohort remains outside the regulated MRM scope.⁷

The structural pattern is bifurcation. The European regime binds AI-authoring provenance through the AI Act and the revised Product Liability Directive. The United States sectoral regimes have not, and the recent banking carve-out indicates that they will not soon. An engineering organisation operating across both jurisdictions will need to satisfy the European evidence requirements regardless of whether it satisfies a domestic regime that has none.

What Engineering Teams Could Do Next

The corrective mathematics is settled. The empirical envelopes are anchored, with uncertainty acknowledged on the upper end of beta. What remains is how a high-criticality organisation would measure ε and beta in production and gate decisions on the result. A small set of measurement and governance primitives suggests itself directly from the literature. An OpenTelemetry semantic-convention proposal under a “code.ai_provenance” namespace would attach AI-authoring provenance to deployed-artefact resource attributes (OpenTelemetry SemConv attribute registry). A per-cohort ε estimator with Wilson-score confidence intervals⁸ (preferred over Clopper-Pearson⁹ at sub-50 sample sizes) would surface incident-attribution rates segmented by AI-cohort. A beta-factor estimator over Jaccard-clustered co-defect groups¹⁰ would surface coupling magnitude empirically rather than assuming a literature value.

Each of these primitives is proposed-yet-not-empirically-validated and pending pilot calibration. None of them has been deployed in a multi-organisation production pilot at the scale required to confirm the assumed beta envelope or to demonstrate that the closed-form failure prediction matches observed incident rates within the precision band that production decision-making requires. The transition from architectural hypothesis to validated practice requires a pilot that runs the Jaccard-cluster beta estimator on at least one full quarter of bug-fix data and produces bootstrap confidence intervals¹¹ that fail to overlap the assumed envelope, or that confirm it. An especially valuable pilot would compare a frontier-model-heavy organisation against a mixed-model organisation. The prediction that follows from the literature is that the former exhibits substantially higher beta than the latter.

Engineering teams that operate AI-authored service chains and have the instrumentation primitives available are the right population to attempt the calibration. The closed-form Mosleh-corrected probability of chain failure is a directly falsifiable prediction. If it systematically over- or under-predicts observed quarterly incident rates, the metastability multiplier requires recalibration or an alternative non-Markov amplification model¹². A single pilot’s worth of disconfirming data would refine the field’s understanding of how much the AI-cohort beta envelope differs from the nuclear baseline. That is a more useful next step than continuing to apply the independence model and being surprised when the realised loss does not match.

Editor’s note: This post was revised on 2026-05-14 to add inline definitions and footnotes for academic and domain-specific terms. The substantive argument, citations, and data are unchanged. The additions are intended to make the piece accessible to readers without a statistics or reliability-engineering background.

Editor’s note: This post was further revised on 2026-05-14 to incorporate material that became available in early May 2026: the OECD AI Incidents Monitor listing for the PocketOS incident, the verbatim text of U.S. Federal Reserve SR 26-2 and OCC Bulletin 2026-13, the Dente et al. preprint on specificational depth (arXiv 2605.06445), the Faros AI 2026 and DORA 2026.01 telemetry datasets, and the 2026-05-07 Council of the European Union political agreement on the EU AI Omnibus. The first four items are anchored in primary sources (regulatory catalogue, agency PDFs, arXiv preprint, and vendor-published datasets respectively). The EU Omnibus item is the exception: the Council press release was inaccessible at the time of writing, so the dates and scope are corroborated by independent legal-analysis coverage rather than the primary text, and the footnote at the relevant paragraph documents this explicitly. The revision comprises five edits: two augmentations (the PocketOS chained-tool-call mechanism and the aggregate-model-risk doctrine quoted from SR 26-2) and three additions (the EU AI Omnibus enforcement-timeline deferral, the Dente et al. specificational-depth result, and a paragraph summarising the Faros and DORA datasets with their authors’ uncertainty caveats), together with six new footnotes. No claim in the prior text was disconfirmed by the updated material, so no excisements were necessary. The substantive argument is unchanged.

A z value is a standardized test statistic measuring how far an observed result sits from the value expected by chance, in units of standard error. When the z value exceeds the critical value, the result is too large to plausibly be chance. Reference: NIST/SEMATECH e-Handbook of Statistical Methods §1.3.6.7.1. ↩
An arXiv preprint, not yet peer-reviewed. The result distinguishes a within-call mode of compounding error (accuracy degradation as the number of structural constraints binding simultaneously inside one agent invocation increases) from the cross-call multiplication of independent failures modeled elsewhere in the post. The two are complementary rather than competing: one measures how a single call decays under increasing specification load, the other measures how independent calls combine across a service chain. Reference: Dente, Satriani, Papotti 2026, arXiv 2605.06445. ↩
The chained-tool-call sequence and the approximate nine-second timing are reported in The Register’s primary coverage of the incident. The OECD AI Incidents Monitor listing is an independent regulatory-grade catalogue entry that confirms the date, the vendors named, and the category of harm. The PocketOS incident is the second public case of an autonomous coding agent destroying a production database alongside its backup, after the Replit/SaaStr incident in July 2025; two named incidents together establish a category rather than a singularity. References: The Register 2026-04-27; OECD AI Incidents Monitor: incident 2026-04-27-6153. ↩
Faros AI is a commercial vendor of engineering-intelligence telemetry, and the report is built on observational data rather than a randomised intervention, so the direction of causation in the reported correlation is contested. Senior engineers may be drawn to harder problems where AI adoption is also higher, in which case AI adoption is a marker of the harder-problem cohort rather than a cause of higher incident rates. The headline figures should be read as a directional signal rather than a precision measurement. Methodological challenge: Dubach commentary on the Faros release. ↩
The 5-to-6-percent change-failure-rate shift and the approximately 344,000 United States dollar downtime figure are scenario inputs to DORA’s ROI calculator, not direct measurements drawn from production data. The DORA authors are explicit about this, characterising the output as “a high-uncertainty estimate meant to spark a conversation, rather than a rigid mathematical formula.” References: DORA 2026.01 ROI report; InfoQ coverage with the authors’ uncertainty caveat. ↩
The 7 May 2026 political agreement was published by the Council of the European Union’s press service, which was inaccessible at the time of writing; the dates and scope cited above are corroborated by independent legal-analysis coverage from Modulos. The agreement defers enforcement only; Article 15’s substantive obligations on accuracy, robustness and cybersecurity for high-risk AI systems are unchanged. Reference: Modulos analysis of the EU AI Omnibus deferral, 2026-05-08. ↩
The aggregate-model-risk language is quoted verbatim from the Federal Reserve’s SR 26-2 source PDF, which together with OCC Bulletin 2026-13 supersedes the long-standing SR 11-7 model-risk-management framework. The formulation is notable as a joint codification of correlated failure modes across a bank’s model inventory: a category of model risk that prior U.S. supervisory guidance has treated only at the level of individual models. Reference: Federal Reserve SR 26-2 PDF. ↩
The Wilson score interval estimates the likely range of a binomial success or failure rate from sample data. It stays accurate with small samples, where alternative methods produce intervals that are too wide. References: Wilson 1927, Journal of the American Statistical Association 22(158): 209-212; comparison and recommendation in Brown, Cai, DasGupta 2001, Statistical Science 16(2): 101-133. ↩
The Clopper-Pearson interval is the exact method for the same confidence-interval problem. It guarantees coverage but produces conservative intervals, especially with few observations. Reference: Clopper and Pearson 1934, Biometrika 26(4): 404-413. ↩
Jaccard clustering groups defects by the overlap of their fingerprints, such as stack traces, files touched, and error signatures. The Jaccard index is the size of the intersection divided by the size of the union of two sets. Reference: Jaccard 1912, New Phytologist 11(2): 37-50. ↩
A bootstrap confidence interval is built by repeatedly resampling the observed data and recomputing the statistic, rather than relying on a closed-form distribution. Reference: Efron 1979, Annals of Statistics 7(1): 1-26. ↩
A non-Markov process is one where the next state depends on the system’s history, not just its current state. Retry storms and queue amplification produce non-Markov dynamics because the queued load itself encodes recent failure history. Reference: Huang et al., OSDI 2022. ↩