top of page

The cost of cascade failure has a number. Here is how to calculate yours.

The ROI case for pre-threshold detection depends on three variables that differ by deployment. This page lays out the model with the assumptions explicit, so you can stress-test it against your own incident data.

The Constraint Architecture Review quantifies your exposure.

A fixed-fee engagement producing a written assessment of your pipeline configurations, threshold positioning, and cascade exposure. Presentable to your CTO and compliance functions.

Where reactive recovery is good enough

Stateless, idempotent pipelines with short reasoning chains and no SLA exposure are well-served by existing reactive orchestration. A failed API call retries in milliseconds. TrueFoundry handles it. The case for pre-detection in that context is weak and we will tell you so in a Calibration Engagement rather than oversell.

Where the gap becomes expensive

Three structural properties of production agentic pipelines break the idempotent retry assumption:

State is expensive.

An agent 14 steps into a reasoning chain has retrieved context, called tools, and built intermediate state. A hard drop loses all of it. The cost is not the infrastructure recovery time — it is the full task restart from step one.

Cascades are the real failure mode.

One degrading node in a multi-agent pipeline doesn't just fail itself. It returns slow or malformed outputs that downstream nodes treat as valid inputs. By the time the infrastructure layer fires its reroute trigger, you have already contaminated several steps. Reactive systems catch the node. They don't catch what that node already did to the pipeline.

Threshold-setting is guesswork.

TrueFoundry fires at whatever latency threshold an engineer set. A node can be structurally transitioning toward failure at 2,000ms while the trigger sits at 5,000ms. You are recovering from failures you could have anticipated, with no mathematical basis for where your threshold should be.

The three cost buckets

Computational waste

Direct and small. A restarted 15-step agentic task doubles your token and compute spend for that task. At £0.50–£2.00 per task depending on model mix, this bucket alone rarely justifies the investment.

Engineering overhead

Meaningful at scale. Degradation failures — the kind COBT catches — are harder to diagnose than hard drops. They produce intermittent symptoms, outputs that look valid but aren't, and post-mortems that take engineering time and interrupt other work.

Business impact

This is where the case lives or dies. Cascade contamination, SLA penalties, and customer-facing quality degradation dwarf the other two buckets. Without contractual SLA penalties, the ROI is still strong — with them, it is the dominant term.

Worked example

Mid-scale enterprise deployment

Deployment assumptions:

Variable

Value

Basis

Monthly agentic tasks

30,000

Mid-scale enterprise

Failure rate

3%

Conservative industry estimate

Degradation vs hard drop

60/40

Only degradation-type failures catchable by COBT

COBT detection efficacy

65%

Conservative; hard drops excluded

Cascade rate

30%

Multi-agent pipelines only

SLA breach rate

10%

Enterprise B2B with contractual SLAs

Monthly catchable failures: 30,000 × 3% × 60% = 540

Weighted cost per failure (reactive):

Task restart (tokens + compute)

£0.80

100% of failures

Cascade cleanup

£600

30% of failures

Engineering triage

£200

25% of failures

SLA penalty

£3,000

10% of failures

Weighted average cost per failure: £631

Annual cost (reactive): 540 × 12 × £631 = £4,088,880

With COBT pre-detection: 351 failures/month converted to planned graceful drains at ~£30/event.

Annual saving: £2,532,012

Year 1 investment:

Constraint Architecture Review

£55,000

Reliability Framework Design

£40,000

Year 1 tooling and monitoring

£18,000

Total Year 1: £113,000

Year 1 net benefit: £2,419,012

ROI: 2,141%

Payback period: 18 days

The caveat

The SLA penalty assumption drives approximately 75% of the return. Without contractual SLA penalties:
Weighted cost per failure: £231
Annual saving: £846,252
ROI: 649%

Still a strong case — but the engineering overhead assumption now carries the weight, and that number needs validating against your actual incident data. That is what the Constraint Architecture Review produces.

Three numbers to get before any conversation

Before presenting any version of this model to your leadership, you need three figures from your own operations:

1. What does a pipeline incident cost your engineering team to resolve? Not the tooling cost — the people cost, including the post-mortem and interrupted sprint work.

2. Do you have contractual SLA obligations on AI-powered outputs, and what are the penalties per breach?

3. What percentage of your current failures produce downstream contamination before your orchestration layer fires?

If you cannot answer question 3, that is itself a diagnostic finding. It means you do not have visibility into your cascade exposure — which is precisely what a Calibration Engagement addresses.

bottom of page