The Compound Error Problem in Agent Workflows

Why 99% accuracy per step does not mean 99% workflow success.

The Math

In a multi-step agent workflow, errors compound exponentially. Each step introduces a small probability of failure, and those probabilities multiply across the entire chain. The formula is straightforward:

Success Rate = (accuracy)steps

The implications are not intuitive. A system that is 99% accurate at each individual step — a level most teams would consider excellent — fails more often than it succeeds across a 100-step workflow.

Workflow Steps Accuracy per Step Success Rate
10 99% 90.4%
50 99% 60.5%
100 99% 36.6%
100 95% 0.59%
100 99.5% 60.6%

The key insight: reducing per-step error from 1% to 0.5% improves 100-step workflow success from 36.6% to 60.6%. Small improvements in per-step accuracy produce outsized gains in end-to-end reliability.

Sources of Error in Agent Workflows

Hallucination & Reasoning Error

The model generates plausible but incorrect outputs, fabricates facts, or follows flawed logical chains. These errors propagate downstream as subsequent steps treat them as ground truth.

Data Quality & Completeness

Missing fields, stale records, inconsistent formats, and incomplete context. The agent reasons correctly from incorrect or partial data, producing confident but wrong conclusions.

Context Loss in Delegation

When one agent delegates to another, critical context is lost in translation. The receiving agent operates with an incomplete picture, making decisions that would be different with full information.

Timing & Concurrency Errors

Race conditions, stale reads, and ordering dependencies. Parallel agents may act on data that another agent has already modified, creating inconsistent state.

Epistemic Drift

The agent's internal model of reality gradually diverges from actual reality. Assumptions valid at the start of a workflow may no longer hold by the time later steps execute. Learn more →

Permission & Authorization Errors

Agents exceeding their granted authority, acting outside approved boundaries, or accumulating permissions across steps that were never intended to be combined. Learn more →

Consensus Voting

The most effective mitigation for compound error is not building a single, more accurate agent. It is deploying multiple independent agents on the same input and using majority voting to determine the output.

The mathematics of consensus voting are powerful. When three independent agents each operate at 95% individual accuracy, the probability that a majority produces the correct answer rises to 99.28%. This is because two or more agents must fail simultaneously for the consensus to be wrong.

Scale this further: thirteen agents at 95% individual accuracy achieve 3.4 DPMO (defects per million opportunities) — the threshold for Six Sigma quality. Applied to a 100-step workflow, this produces a 99.97% end-to-end success rate.

“The compound error problem is not solved by making one agent smarter. It is solved by making the system more redundant.”

When to Use Consensus Voting

Reducing Error at the Source

Consensus voting reduces error propagation, but the most effective strategy is reducing error at each step before it compounds. The mitigations below are listed in order of reliability, from most deterministic to least.

Tools, APIs, and Code Execution

The most reliable way to eliminate agent error is to avoid asking the LLM to reason about something that can be computed deterministically. When an agent needs a calculation, a database lookup, a regulatory status check, or a data transformation, it should call a tool or API rather than generating the answer from its training data.

Tool use via Model Context Protocol (MCP) and direct API integration provides deterministic, verifiable results. A function that queries a database returns the correct balance every time. An LLM asked to estimate that balance will sometimes hallucinate. The governance implication is clear: prefer skills and executable code over generative reasoning wherever the task permits it.

Grounded Search

For questions that require current information but do not have a deterministic API, grounded search provides a middle layer of reliability. The agent queries authoritative sources and grounds its response in retrieved evidence. This is more reliable than pure generation but less reliable than tool execution, because the agent must still interpret and synthesise the retrieved content.

Retrieval-Augmented Generation (RAG)

RAG provides the agent with relevant context from internal knowledge bases, reducing hallucination by anchoring generation in real documents. However, RAG introduces its own failure modes. LLMs occasionally ignore retrieved context, selectively attend to fragments that confirm a pre-existing pattern, or synthesise retrieved data incorrectly. The result can be a confident, well-sourced hallucination that is harder to detect than an unsourced one.

RAG is a valid mitigation that improves accuracy, but it should not be treated as a complete solution. Critical outputs grounded by RAG should still be validated against deterministic checks where possible.

FMOps (Foundation Model Operations)

Continuous prompt refinement, model selection, and performance monitoring ensure that per-step accuracy improves over time rather than degrading. DMAIC applied to prompt engineering provides the measurement discipline.

Automated Output Validation

Regardless of how an agent arrives at its output, a validation step that checks the result against known baselines, business rules, or computational verification catches errors before they propagate to the next step. This is the final quality gate: the agent produces a result, code validates it, and only verified outputs proceed.

The reliability hierarchy is clear: deterministic tools first, grounded search second, RAG-augmented generation third, pure generation last. The best agent architectures minimise the steps that depend on generative reasoning and maximise the steps that can be verified by code.

The best approach is layered: use tools and APIs for everything computable, ground search for current information, RAG for contextual knowledge, FMOps for continuous improvement, and validation gates before every consequential action. Layer consensus voting on top for the highest-risk decisions. This produces reliable systems without the cost of running consensus voting on every step.

See how this connects to our Six Sigma measurement framework and Center of Excellence advisory service.

Address Compound Error with Six Sigma Measurement

See how Six Sigma measurement and consensus architectures transform agent reliability from aspirational to measurable.