The cost of cascade failure has a number. Here is how to calculate yours.
The ROI case for pre-threshold detection depends on three variables that differ by deployment. This page lays out the model with the assumptions explicit, so you can stress-test it against your own incident data.
The Constraint Architecture Review quantifies your exposure.
A fixed-fee engagement producing a written assessment of your pipeline configurations, threshold positioning, and cascade exposure. Presentable to your CTO and compliance functions.
Where reactive recovery is good enough
Stateless, idempotent pipelines with short reasoning chains and no SLA exposure are well-served by existing reactive orchestration. A failed API call retries in milliseconds. TrueFoundry handles it. The case for pre-detection in that context is weak and we will tell you so in a Calibration Engagement rather than oversell.
Where the gap becomes expensive
Three structural properties of production agentic pipelines break the idempotent retry assumption:
State is expensive.
An agent 14 steps into a reasoning chain has retrieved context, called tools, and built intermediate state. A hard drop loses all of it. The cost is not the infrastructure recovery time — it is the full task restart from step one.
Cascades are the real failure mode.
One degrading node in a multi-agent pipeline doesn't just fail itself. It returns slow or malformed outputs that downstream nodes treat as valid inputs. By the time the infrastructure layer fires its reroute trigger, you have already contaminated several steps. Reactive systems catch the node. They don't catch what that node already did to the pipeline.
Threshold-setting is guesswork.
TrueFoundry fires at whatever latency threshold an engineer set. A node can be structurally transitioning toward failure at 2,000ms while the trigger sits at 5,000ms. You are recovering from failures you could have anticipated, with no mathematical basis for where your threshold should be.
The three cost buckets
Computational waste
Direct and small. A restarted 15-step agentic task doubles your token and compute spend for that task. At £0.50–£2.00 per task depending on model mix, this bucket alone rarely justifies the investment.
Engineering overhead
Meaningful at scale. Degradation failures — the kind COBT catches — are harder to diagnose than hard drops. They produce intermittent symptoms, outputs that look valid but aren't, and post-mortems that take engineering time and interrupt other work.
Business impact
This is where the case lives or dies. Cascade contamination, SLA penalties, and customer-facing quality degradation dwarf the other two buckets. Without contractual SLA penalties, the ROI is still strong — with them, it is the dominant term.
Worked example
Mid-scale enterprise deployment
Deployment assumptions:
Variable
Value
Basis
Monthly agentic tasks
30,000
Mid-scale enterprise
Failure rate
3%
Conservative industry estimate
Degradation vs hard drop
60/40
Only degradation-type failures catchable by COBT
COBT detection efficacy
65%
Conservative; hard drops excluded
Cascade rate
30%
Multi-agent pipelines only
SLA breach rate
10%
Enterprise B2B with contractual SLAs
Monthly catchable failures: 30,000 × 3% × 60% = 540
Weighted cost per failure (reactive):
Task restart (tokens + compute)
£0.80
100% of failures
Cascade cleanup
£600
30% of failures
Engineering triage
£200
25% of failures
SLA penalty
£3,000
10% of failures
Weighted average cost per failure: £631
Annual cost (reactive): 540 × 12 × £631 = £4,088,880
With COBT pre-detection: 351 failures/month converted to planned graceful drains at ~£30/event.
Annual saving: £2,532,012
Year 1 investment:
Constraint Architecture Review
£55,000
Reliability Framework Design
£40,000
Year 1 tooling and monitoring
£18,000
Total Year 1: £113,000
Year 1 net benefit: £2,419,012
ROI: 2,141%
Payback period: 18 days
The caveat
The SLA penalty assumption drives approximately 75% of the return. Without contractual SLA penalties:
Weighted cost per failure: £231
Annual saving: £846,252
ROI: 649%
Still a strong case — but the engineering overhead assumption now carries the weight, and that number needs validating against your actual incident data. That is what the Constraint Architecture Review produces.
Three numbers to get before any conversation
Before presenting any version of this model to your leadership, you need three figures from your own operations:
1. What does a pipeline incident cost your engineering team to resolve? Not the tooling cost — the people cost, including the post-mortem and interrupted sprint work.
2. Do you have contractual SLA obligations on AI-powered outputs, and what are the penalties per breach?
3. What percentage of your current failures produce downstream contamination before your orchestration layer fires?
If you cannot answer question 3, that is itself a diagnostic finding. It means you do not have visibility into your cascade exposure — which is precisely what a Calibration Engagement addresses.