Skip to content

Maker Consensus Pattern

Status: Proposal Date: 2026-02-12 Author: Community Proposal Scope: Addition of a new orchestration pattern to the pattern catalog, providing redundant execution with statistical error reduction via multi-agent voting. Reference: Meyerson et al., “Solving a Million-Step LLM Task with Zero Errors” (arXiv:2511.09030, November 2025)


This proposal introduces Redundant Consensus as a potential Pattern 8 in the claude-team-orchestration pattern catalog. The pattern is inspired by the MAKER framework (Maximal Agentic decomposition, K-voting, Error Red-flagging), a research system that achieved zero errors across 1,048,575 sequential LLM steps on the Towers of Hanoi benchmark.

The core idea: decompose a task into atomic subtasks, execute each subtask with N independent agents, and use majority voting to select the correct answer. When per-step accuracy exceeds 50%, voting amplifies correctness toward statistical certainty. Red-flagging adds a self-check layer that rejects unreliable outputs before they enter the vote.

No existing pattern in the catalog provides this coordination topology — same task given to multiple agents whose outputs are compared for consensus. Parallel Specialists assign different lenses to the same input; Redundant Consensus assigns the same lens to the same input and uses statistical redundancy to catch errors.

This document presents the framework, maps it to Claude Code primitives, evaluates fit with equal weight to arguments for and against, and proposes a concrete implementation design for community consideration.


MAKER was published by Meyerson et al. (Cognizant AI Labs) in November 2025. The paper demonstrated that a framework combining extreme decomposition with voting-based error correction could solve arbitrarily long sequential tasks with zero errors — specifically, a 20-disk Towers of Hanoi problem requiring 2^20 - 1 = 1,048,575 individual moves, each produced by a separate LLM call.

The key results:

  • Zero errors across all 1,048,575 steps
  • Log-linear cost scaling in step count (not exponential)
  • Model independence: even models with ~55–60% per-step accuracy achieve zero-error solutions through voting — no state-of-the-art reasoning model required
  • GPT-4.1-mini was sufficient as the voter model after cost optimization

Pillar 1: MAD (Maximal Agentic Decomposition)

Section titled “Pillar 1: MAD (Maximal Agentic Decomposition)”

Tasks are broken into the smallest possible units — ideally single-step subtasks where m=1. Each agent receives:

  • The overall solution strategy (provided explicitly, not discovered)
  • The current state
  • The prior action taken
  • An instruction to produce one next action and the resulting state

This separates strategic reasoning (what approach to take) from mechanical execution (what the next step is). The agent only handles execution; the strategy is pre-determined and injected into every prompt.

Why this matters: monolithic approaches where one agent generates an entire solution sequence (m=s) show exponential error accumulation. A single mistake at step 50 corrupts all subsequent steps. MAD isolates each step so errors cannot cascade.

Instead of fixed-count majority voting (always sample 3 or 5 times), MAKER uses an adaptive scheme: samples are drawn until one candidate achieves K more votes than any competitor. This adapts naturally — easy steps converge in 1–2 samples, hard steps draw more.

The winning probability for a correct answer:

p(correct win) = 1 / (1 + ((1-p)/p)^K)

Where p is the per-step accuracy. For the full task with s steps:

p_full = (1 + ((1-p)/p)^K)^(-s)

The minimum K grows logarithmically with step count: K_min = Θ(ln s). For 1M steps, roughly 18 voting rounds suffice for the hardest steps.

Temperature control: The first vote uses temperature 0 (deterministic best guess). Subsequent votes use temperature 0.1 — enough diversity to decorrelate errors while staying close to the model’s best output.

Token budget: Maximum 750 output tokens per step. This constraint is both a cost control and a red-flagging signal (see Pillar 3).

Before a response enters the vote tally, it passes through self-checks that reject unreliable outputs:

  1. Length check: Responses exceeding ~700 tokens show dramatically higher error rates (approaching 10% vs. typical 0.1%). Overly long outputs signal confused reasoning and are rejected.
  2. Format check: Responses must match expected output patterns (in the paper: move = <move> and next_state = <next_state>). Incorrect formatting signals confused reasoning.

Red-flagging reduces correlated errors more effectively than improving base accuracy. The valid response probability v adjusts the expected cost:

E[cost] = Θ(cs × ln s / (vp))

Where c is the per-token cost, s is step count, v is valid response rate, and p is accuracy among valid responses.

The critical property: with maximal decomposition (m=1), expected cost grows log-linearly in step count:

E[cost] = Θ(p⁻¹ × cs × ln s)

This contrasts sharply with coarser decompositions (m > 1), which show exponential cost growth in m. The log factor comes from K_min = Θ(ln s) — voting needs logarithmically more rounds as tasks get longer, not linearly more.

For context: doubling the number of subtasks roughly adds one more voting round rather than doubling the cost.


3. Gap Analysis — What’s Missing from Existing Patterns

Section titled “3. Gap Analysis — What’s Missing from Existing Patterns”
#PatternCoordination TopologyAgent RelationshipOutput Combination
1Parallel SpecialistsFan-out / fan-inDifferent focus, same inputSynthesize diverse perspectives
2PipelineLinear chainSequential handoffEach stage transforms prior output
3SwarmWork poolSame role, different inputsIndependent results, no combination
4Research + ImplementationTwo-phaseResearch feeds implementationKnowledge transfer
5Plan ApprovalGateWorker + approverApproval / rejection
6Multi-File RefactoringFan-out with fan-inSame role, different filesIntegration testing
7RLMFan-out / fan-inSame role, different chunksSynthesis of chunk findings

None of the seven patterns provide: same task, same focus, N agents, consensus voting.

The closest pattern is Parallel Specialists (#1), but it assigns different lenses to the same input — a security reviewer, a performance reviewer, and an architecture reviewer each look at the same code but analyze different aspects. Their outputs are complementary and get synthesized.

Redundant Consensus assigns the same lens to the same input — three agents all answer “Is this function vulnerable to SQL injection?” independently. Their outputs are redundant and get voted on.

DimensionParallel SpecialistsRedundant Consensus
Agent promptsDifferent (each has unique focus)Identical (same task)
OutputsComplementary perspectivesRedundant answers
Combination methodSynthesis (merge all)Voting (pick majority)
GoalCoverage (find everything)Accuracy (get it right)
Error modelEach catches what others missErrors decorrelate across agents
Failure modeMissing a perspectiveCorrelated errors defeat voting

Swarm (#3) is also distinct — it assigns the same role to different inputs. Three swarm workers process three different files. Redundant Consensus gives the same input to three workers and compares their answers.

The gap is real but narrow. Most software engineering tasks are creative — there is no single correct answer to “refactor this function” or “write a test for this endpoint.” Voting requires tasks with verifiable or at least comparable outputs. The gap exists specifically for:

  • Binary classification tasks (vulnerable / not vulnerable)
  • Deterministic transformations (callback → async/await)
  • Exact-match outputs (configuration validation: compliant / non-compliant)
  • Structured answers with enumerable options

For tasks outside this scope, the existing patterns are better fits.


The pattern catalog should be complete — covering the space of useful coordination topologies. Redundant Consensus fills a real gap: the case where you want high confidence in a single answer rather than diverse perspectives or parallel throughput.

4.2 Concrete Software Engineering Use Cases

Section titled “4.2 Concrete Software Engineering Use Cases”
For each function in src/:
- 3 voters independently answer: "Is this function vulnerable to injection? YES/NO"
- Majority wins
- Red-flag: any voter that produces >200 tokens of reasoning (likely confused)

This is the strongest use case. Binary classification has exact-match voting, no semantic comparison needed. A single reviewer might miss a subtle vulnerability or hallucinate one; three independent reviews with majority voting reduce both false positives and false negatives.

Test Generation Verification (Generate-then-Verify)

Section titled “Test Generation Verification (Generate-then-Verify)”
Step 1: One agent generates a test suite
Step 2: 3 voters independently answer: "Does this test correctly validate the specification? YES/NO"
Step 3: Majority determines whether to accept or regenerate

A variant where voting validates rather than generates. The creative work (test writing) stays with one agent; the mechanical judgment (does this test do what it claims?) gets voted on.

For each file requiring callback→async/await conversion:
- 3 voters independently produce the converted code
- Exact-match comparison (after whitespace normalization)
- Any disagreement → flag for human review

Mechanical code transformations have deterministic correct outputs. Three agents should produce identical conversions; disagreement signals a genuinely ambiguous case worth human attention.

For each configuration entry:
- 3 voters check: "Does this comply with policy X? PASS/FAIL"
- Majority determines compliance status
- Disputed entries escalated to human reviewer

Policy compliance checks are binary and well-defined. Voting reduces the chance of a single agent misinterpreting a policy rule.

For each function using deprecated API pattern:
- 3 voters independently produce the updated function
- Exact-match vote (normalized)
- Unanimous agreement → auto-apply
- Split vote → manual review

When the transformation is mechanical (rename, signature change, pattern substitution), redundancy catches the occasional hallucination or omission.

Haiku as the voter model makes redundancy cheap:

ScenarioSubtasksVotersModelEst. Cost
Security audit (50 functions)503Haiku~$0.15
Compliance check (100 policies)1003Haiku~$0.25
Migration validation (30 files)303Haiku~$0.10

The cost multiplier (3x per subtask) is offset by using the cheapest model. For comparison, a single Sonnet pass over the same 50 functions costs ~$0.50 — three Haiku voters cost less.

The mathematical foundation is sound: if each voter independently gets the right answer with probability p > 0.5, majority voting amplifies correctness exponentially in the number of voters. With 3 voters and p = 0.8:

  • Single agent accuracy: 80%
  • 3-voter majority accuracy: 89.6%
  • 5-voter majority accuracy: 94.2%

With ahead-by-K voting (K=2) and p = 0.8:

  • Correct-win probability per step: 94.1%
  • Over 100 steps: ~99.8% (K can be tuned higher)

The paper proves this scales to millions of steps with appropriate K.


5.1 MAKER Was Designed for Mechanical Tasks

Section titled “5.1 MAKER Was Designed for Mechanical Tasks”

The Towers of Hanoi benchmark is maximally mechanical: every step has exactly one correct answer, the state representation is compact and unambiguous, and the strategy is provided in advance. Most software engineering tasks are creative, subjective, or context-dependent:

  • “Refactor this function” has many valid outputs
  • “Write a descriptive variable name” is subjective
  • “Design an API endpoint” involves tradeoffs

Voting only works when there is a consensus-eligible answer — something that can be meaningfully compared across agents. This restricts the pattern to a narrow subset of SE tasks.

Severity: High. This is the fundamental limitation. The pattern is genuinely useful only for binary/deterministic/enumerable tasks.

For non-trivial code outputs, comparing agent answers requires semantic equivalence checking. Two functions that do the same thing may have different variable names, control flow, or formatting. The paper avoided this entirely by using a structured output format with exact-match comparison.

In software engineering, exact-match comparison works for:

  • Binary YES/NO answers
  • Numeric values
  • Enum selections
  • Single-line outputs

It does not work for:

  • Multi-line code generation
  • Free-text explanations
  • Architectural recommendations

Severity: High. This limits practical applicability to tasks with constrained output formats.

5.3 No Native Voting Primitives in Claude Code

Section titled “5.3 No Native Voting Primitives in Claude Code”

Claude Code provides TeamCreate, TaskCreate, TaskUpdate, TaskGet, and SendMessage. None of these implement voting, consensus, or output comparison. The entire voting mechanism must be built from:

  • Multiple identical tasks (voter agents)
  • Message collection at the lead/judge
  • Custom comparison logic in the judge’s prompt
  • Consensus determination via prompt engineering

This works but is fragile — the judge must correctly parse, compare, and tally heterogeneous outputs. There is no programmatic guarantee of correct vote counting.

Severity: Medium. Prompt-based voting works for simple output formats (YES/NO, PASS/FAIL) but becomes unreliable for complex outputs.

Every subtask runs N times (typically 3). Even with Haiku, this is a 3x cost multiplier on agent compute. For large task decompositions, this can become significant:

SubtasksSingle pass (Haiku)3-voter (Haiku)3-voter (Sonnet)
10~$0.01~$0.03~$0.15
100~$0.05~$0.15~$1.50
1,000~$0.50~$1.50~$15.00

The cost is justified only when the accuracy improvement matters more than the cost increase. For many SE tasks, a single careful pass is sufficient.

Severity: Medium. Manageable with Haiku voters, but users must consciously decide the accuracy/cost tradeoff.

The voting argument assumes independent errors — each agent fails independently with probability 1-p. In practice, LLM errors are often correlated:

  • Systematic model biases: If Haiku consistently misclassifies a certain code pattern, all three voters will get it wrong. Voting amplifies correctness only for random errors, not systematic ones.
  • Shared training data: All voters share the same training distribution. They are likely to share the same blind spots.
  • Identical prompts: The paper used temperature variation (0 then 0.1) to decorrelate, but identical prompts still bias toward the same failure modes.

Mitigation strategies (varied prompts, different few-shot examples, temperature variation) help but cannot eliminate correlation entirely.

Severity: High. This is the strongest theoretical objection. The paper acknowledges this and addresses it for Towers of Hanoi specifically, but generalizing to SE tasks requires empirical validation.

The judge must collect all votes, parse all outputs, compare them, and determine consensus — for every subtask. This creates a sequential bottleneck:

  • Judge context grows linearly with the number of subtasks × voters
  • For large task decompositions, the judge risks context exhaustion
  • The judge itself can make errors in vote tallying

Severity: Medium. Mitigated by batching and the lead-managed variant for small task counts, but becomes a real constraint at scale.

Claude Code teams are flat — all agents exist in one team with one task list. MAKER’s architecture (orchestrator → decomposer → voter pools) maps to a flat structure where the team lead or a judge agent manages both decomposition and voting. This conflates concerns and makes the team lead/judge a bottleneck.

Severity: Low. Workable with the two-variant design (lead-managed for small, judge-mediated for large), but less clean than a hierarchical implementation would be.

The existing seven patterns are all broadly applicable. Every engineering team will use Swarm, Pipeline, or Parallel Specialists regularly. Redundant Consensus is genuinely useful only for a narrow class of tasks. Including it in the catalog at equal prominence risks:

  • Users applying it to inappropriate tasks (voting on creative work)
  • Inflating the pattern count without proportional utility
  • Confusing users choosing between patterns

Severity: Medium. Mitigated by clear “when to use” and “when NOT to use” guidance, but the risk of misapplication is real.


Pillar 1: MAD → TaskCreate with Fine-Grained Decomposition

Section titled “Pillar 1: MAD → TaskCreate with Fine-Grained Decomposition”

The team lead decomposes the overall task into atomic subtasks using TaskCreate. Each task represents one unit of work to be voted on.

// Decompose "audit 50 functions for SQL injection" into 50 tasks
for (const func of functions) {
TaskCreate({
subject: `Vote: Is ${func.name} vulnerable to SQL injection?`,
description: `
File: ${func.file}
Lines: ${func.startLine}-${func.endLine}
Question: Is this function vulnerable to SQL injection? Answer YES or NO.
Include confidence (HIGH/MEDIUM/LOW) and one-line reasoning.
`,
activeForm: `Voting on ${func.name}`
})
}

Granularity is configurable — for binary classification, each function is one subtask. For migration validation, each file is one subtask.

Pillar 2: Voting → N Identical Tasks + Collection via SendMessage

Section titled “Pillar 2: Voting → N Identical Tasks + Collection via SendMessage”

For each subtask, N voter agents independently process the same input. Votes are collected by the team lead or a dedicated judge.

Lead-managed approach (for <10 subtasks):

// Spawn 3 voters for each subtask — voters send results via SendMessage
Task({
team_name: "maker-team",
name: "voter-1",
subagent_type: "swarm:maker-voter",
model: "haiku",
prompt: `Vote on task: "${taskDescription}"`,
run_in_background: true
})
// ... repeat for voter-2, voter-3
// Lead collects 3 messages, tallies inline

Judge-mediated approach (for 10+ subtasks):

// Persistent judge collects votes and tallies
Task({
team_name: "maker-team",
name: "judge",
subagent_type: "swarm:maker-judge",
model: "haiku",
prompt: `Collect votes from voters. For each subtask, determine consensus...`,
run_in_background: true
})

Pillar 3: Red-Flagging → Prompt-Level Self-Check + Judge-Level Filtering

Section titled “Pillar 3: Red-Flagging → Prompt-Level Self-Check + Judge-Level Filtering”

Red-flagging operates at two levels:

  1. Voter self-check (prompt-level): The voter prompt includes instructions to flag uncertainty:

    If you are uncertain or your reasoning exceeds 3 sentences, set confidence to LOW.
    If the question is ambiguous or unanswerable from the given context, answer UNCERTAIN.
  2. Judge-level filtering: The judge rejects votes with LOW confidence or UNCERTAIN answers before tallying.

  3. Optional hooks (implementation-level): A Claude Code hook could validate voter output format before it reaches the judge:

    {
    "event": "on_message",
    "script": "validate-voter-output.sh",
    "pattern": "maker-voter"
    }
MAKER ConceptClaude Code PrimitiveNotes
Task decompositionTaskCreateOne task per votable subtask
Voter agentTask with swarm:maker-voterHaiku model, identical prompt
Vote collectionSendMessage to judge/leadStructured JSON output
Vote tallyingJudge agent prompt logicParse and compare votes
Red-flag: lengthVoter prompt: “keep response under N tokens”Prompt-enforced
Red-flag: formatJudge parses JSON, rejects malformedJudge-enforced
Red-flag: confidenceVoter self-reports confidence levelPrompt-enforced
Ahead-by-KJudge tracks vote counts per answerPrompt logic
Temperature variationNot directly controllableUse prompt variation instead
State passingTask description carries contextVia TaskCreate description
Result aggregationJudge sends final report to leadVia SendMessage
PaperClaude Code Implementation
Temperature 0 for first vote, 0.1 for subsequentTemperature not controllable per-call; use prompt variation for decorrelation
Programmatic vote parsing and tallyingPrompt-based parsing by judge agent
Exact-match string comparisonJudge compares structured JSON fields
Adaptive sampling (draw until ahead-by-K)Fixed N voters or judge requests additional voters via message
Token-count red-flaggingPrompt-based length guidance; no programmatic token counting

  • Name: Redundant Consensus (MAKER-Inspired)
  • Number: 8
  • When to use: Binary classification, deterministic validation, compliance checks — tasks with verifiable correct answers where high confidence matters more than speed or cost
  • When NOT to use: Creative tasks, subjective judgment, code generation, any task where “correct” is ambiguous

The team lead spawns voters directly and tallies inline. No dedicated judge needed.

User prompt:
"Check these 5 functions for SQL injection vulnerability using
redundant consensus with 3 voters each."
Team Lead behavior:
1. Create team
2. For each function (5 subtasks):
a. Spawn 3 voter agents (Haiku) with identical prompts
b. Collect 3 responses via incoming messages
c. Parse JSON from each voter
d. Apply red-flag filter (reject LOW confidence, malformed JSON)
e. Tally valid votes:
- If one answer leads by ≥2 among valid votes → accept as consensus
- If tied or insufficient valid votes → mark as DISPUTED
f. Record result for this subtask
3. Compile all results into final report
4. Shutdown voters, delete team

Pseudocode for lead-managed consensus:

function processSubtask(subtask, voterCount = 3):
votes = []
for i in 1..voterCount:
spawn voter-{i} as swarm:maker-voter (Haiku, background)
prompt: formatVoterPrompt(subtask)
// Collect responses
for i in 1..voterCount:
response = awaitMessage(from: voter-{i})
parsed = parseVoterJSON(response)
if parsed.confidence != "LOW" and parsed.answer != "UNCERTAIN":
votes.append(parsed)
// Tally
if votes.length < 2:
return { subtask, result: "DISPUTED", reason: "insufficient valid votes" }
counts = countBy(votes, v => v.answer)
leader = maxBy(counts, (answer, count) => count)
if leader.count >= leader.count + 2: // ahead-by-K where K=2
return { subtask, result: leader.answer, confidence: "CONSENSUS" }
else if leader.count > votes.length / 2:
return { subtask, result: leader.answer, confidence: "MAJORITY" }
else:
return { subtask, result: "DISPUTED", votes: counts }

A persistent judge agent manages vote collection, freeing the team lead from per-vote coordination.

Team structure:
- team-lead: Decomposes task, creates subtasks, spawns judge + voters
- judge: Persistent agent that claims voting tasks, spawns voters, tallies
- voter-1, voter-2, voter-3: Identical Haiku agents, process tasks from pool
Team Lead behavior:
1. Create team
2. Create one task per subtask (all pending)
3. Spawn judge agent (Haiku)
4. Spawn 3 voter agents (Haiku)
5. Judge claims subtasks, assigns to voters via SendMessage
6. Judge collects votes, tallies, records results to task descriptions
7. Judge sends consolidated report to team-lead when all subtasks done
8. Team lead presents report, shuts down team
Judge behavior:
1. Check TaskList for pending subtasks
2. For each pending subtask:
a. Send the subtask prompt to all 3 voters via SendMessage
b. Collect 3 responses
c. Apply red-flag filters
d. Tally votes, determine consensus
e. Write result to task description via TaskUpdate
f. Mark task completed
3. After all tasks: send consolidated report to team-lead

All voters produce structured JSON:

{
"answer": "YES",
"confidence": "HIGH",
"reasoning": "The function concatenates user input directly into the SQL query at line 42 without parameterization."
}

Field constraints:

  • answer: One of the expected values for this subtask type (YES/NO, PASS/FAIL, or the specific transformation output)
  • confidence: HIGH, MEDIUM, or LOW
  • reasoning: One sentence maximum. Longer reasoning triggers the length red-flag.
function determineConsensus(votes, K = 2):
// Step 1: Red-flag filter
validVotes = votes.filter(v =>
v.confidence != "LOW" and
v.answer != "UNCERTAIN" and
v.reasoning.sentenceCount <= 3
)
// Step 2: Insufficient valid votes
if validVotes.length < 2:
return DISPUTED("insufficient valid votes after red-flag filtering")
// Step 3: Count by answer
counts = groupAndCount(validVotes, by: "answer")
sorted = sortDescending(counts)
leader = sorted[0]
runnerUp = sorted.length > 1 ? sorted[1] : { count: 0 }
// Step 4: Ahead-by-K fast path
if leader.count - runnerUp.count >= K:
return CONSENSUS(leader.answer)
// Step 5: Simple majority fallback
if leader.count > validVotes.length / 2:
return MAJORITY(leader.answer)
// Step 6: No consensus
return DISPUTED("split vote", details: counts)

Since Claude Code does not expose temperature control per API call, decorrelation must come from prompt-level variation:

StrategyImplementationEffectiveness
Varied system promptsEach voter gets a slightly different framing (“You are a security expert”, “You are a code auditor”, “You are a penetration tester”)Medium — introduces perspective diversity but may bias answers
Different few-shot examplesEach voter sees different example classificationsMedium — helps with pattern variety
Reordered contextPresent the code snippet with different surrounding contextLow — minor effect on classification tasks
Role rotationVoter 1 argues YES, Voter 2 argues NO, Voter 3 gives honest answerHigh for adversarial tasks, not suitable for classification

Recommended default: Use identical prompts. The simplicity benefit outweighs the marginal decorrelation from prompt variation, and identical prompts make vote comparison straightforward. Reserve prompt variation for cases where empirical testing shows correlated failures.

StrategyMechanismTradeoff
Haiku votersUse cheapest model for votersLower per-vote cost; slightly lower per-step accuracy
Selective votingOnly vote on ambiguous subtasks; single-pass for obvious onesReduces total votes but requires a pre-classification step
Adaptive NStart with N=3; increase to N=5 only for DISPUTED resultsReduces average cost; adds latency for disputed items
Batch processingProcess all subtasks before voting (generate answers, then compare)Better parallelism; requires holding all outputs in context
Early terminationStop voting when ahead-by-K is reached (don’t wait for all N)Cannot implement cleanly with background agents; requires sequential spawning

These agents would be created if the pattern is accepted. Definitions here are specifications only — no files are created by this proposal.

File: agents/maker-voter.md

---
name: maker-voter
description: Independent voter agent for MAKER-inspired redundant consensus. Analyzes a single subtask and returns a structured JSON vote with answer, confidence, and reasoning. Includes red-flag self-checks.
model: haiku
tools:
- Read
- Grep
- Glob
- SendMessage
- TaskList
- TaskGet
- TaskUpdate
color: yellow
---

Behavioral description:

# MAKER Voter Agent
You are an independent voter in a redundant consensus workflow. Your job is to
analyze a single subtask and return a structured vote.
## Critical Rules
1. **Independence**: Do not coordinate with other voters. Your answer must be
your own independent judgment.
2. **Structured output**: Always return valid JSON matching the schema below.
3. **Brevity**: Keep reasoning to ONE sentence. If you need more than one
sentence, your confidence should be LOW.
4. **Honesty**: If you are uncertain, say UNCERTAIN. Do not guess.
## Output Schema
{
"answer": "<YES|NO|PASS|FAIL|or task-specific value>",
"confidence": "<HIGH|MEDIUM|LOW>",
"reasoning": "<one sentence explaining your answer>"
}
## Red-Flag Self-Checks
Before submitting your vote, verify:
- [ ] Your reasoning is ONE sentence (not two or more)
- [ ] Your answer matches one of the expected values
- [ ] You have actually read the relevant code/content
- [ ] If uncertain, confidence is LOW or answer is UNCERTAIN
## Team Workflow
1. Check TaskList for pending voting tasks
2. Claim a task with TaskUpdate (set owner to your name, status to in_progress)
3. Read the relevant file/content specified in the task description
4. Form your independent judgment
5. Send your JSON vote to the judge or team-lead via SendMessage
6. Mark task completed via TaskUpdate
7. Check TaskList for more work — repeat until no pending tasks remain

File: agents/maker-judge.md

---
name: maker-judge
description: Vote collection and consensus determination agent for MAKER-inspired redundant consensus. Collects votes from voter agents, applies red-flag filtering, tallies results, and reports consensus or disputes.
model: haiku
tools:
- Read
- Grep
- Glob
- SendMessage
- TaskList
- TaskGet
- TaskUpdate
color: magenta
---

Behavioral description:

# MAKER Judge Agent
You are the judge in a redundant consensus workflow. Your job is to collect
votes from voter agents, filter unreliable votes, determine consensus, and
report results.
## Workflow
For each subtask:
1. Receive votes from voter agents (via incoming messages or task descriptions)
2. Parse each vote's JSON: { answer, confidence, reasoning }
3. Apply red-flag filters:
- REJECT votes with confidence: "LOW"
- REJECT votes with answer: "UNCERTAIN"
- REJECT votes with reasoning longer than 2 sentences
- REJECT votes with malformed JSON
4. Tally valid votes by answer value
5. Determine consensus:
- CONSENSUS: Leading answer ahead by ≥K (default K=2) → high confidence
- MAJORITY: Leading answer has >50% of valid votes → moderate confidence
- DISPUTED: No clear winner → flag for review
6. Record result in the task description via TaskUpdate
## Result Format
For each subtask, record:
{
"subtask": "<subtask identifier>",
"result": "<CONSENSUS|MAJORITY|DISPUTED>",
"answer": "<winning answer or null if DISPUTED>",
"vote_counts": { "YES": 2, "NO": 1 },
"filtered_count": 0,
"filter_reasons": []
}
## Final Report
After all subtasks are processed, compile a consolidated report:
## Consensus Results
| Subtask | Result | Answer | Votes | Filtered |
|---------|--------|--------|-------|----------|
| func_a | CONSENSUS | NO | 3/3 | 0 |
| func_b | CONSENSUS | YES | 2/3 | 0 |
| func_c | DISPUTED | — | 1Y/1N | 1 |
### Summary
- Total subtasks: X
- Consensus reached: Y (Z%)
- Disputed: W (require manual review)
Send this report to team-lead via SendMessage.
## Guidelines
- Never cast your own vote — you are an impartial judge
- Never modify or reinterpret voter answers
- Apply red-flag filters consistently
- Report disputed items prominently — they need human attention

DimensionRedundant ConsensusParallel SpecialistsSwarmRLM
Task typeBinary/deterministicMulti-perspective analysisMany similar independentLarge file analysis
Agent promptsIdenticalDifferent focus per agentIdenticalIdentical (per content type)
Input per agentSame inputSame inputDifferent input eachDifferent chunk each
Output combinationVoting (pick majority)Synthesis (merge all)None (independent)Synthesis (merge all)
GoalHigh accuracy on one answerComprehensive coverageThroughputComprehensive analysis
Optimal agent count3–5 (odd numbers)2–5 (one per focus)2–5 (load-dependent)3–6 (chunk-dependent)
Model recommendationHaiku (voters), Haiku (judge)Varies by focusVaries by taskHaiku (analysts), Sonnet (synthesizer)
Cost profileN × per subtask × countFixed (one per focus)Fixed (one per item)Fixed (one per chunk)
Error modelRandom errors decorrelateDifferent errors per lensIndependentIndependent per chunk
Failure modeCorrelated errorsMissing a perspectiveBottleneck on slow itemsChunk boundary effects

Use this pattern selection guide:

Is the output verifiable (binary, deterministic, exact-match)?
YES → Is high confidence worth 3x cost?
YES → Redundant Consensus
NO → Single-pass Swarm
NO → Do you need multiple perspectives?
YES → Parallel Specialists
NO → Is the input too large for one context?
YES → RLM
NO → Is work sequential?
YES → Pipeline
NO → Swarm
SituationWhy NotBetter Alternative
Code generation (write a function)No verifiable “correct” outputSingle agent or Pipeline
Code review (find all issues)Goal is coverage, not consensusParallel Specialists
Refactoring (improve code quality)Subjective, many valid outputsMulti-File Refactoring
Architecture designCreative, no single answerPlan Approval
Large file analysisNeed comprehensive findings, not consensusRLM
Many independent filesNeed throughput, not accuracy per fileSwarm
Research tasksExploratory, no votable questionResearch + Implementation

#LimitationSeverityMitigation
1Only works for tasks with verifiable correct answersHighClear documentation of eligible task types; “when NOT to use” guidance
2Correlated LLM errors defeat votingHighPrompt variation, different few-shot examples, empirical validation before deployment
3Output comparison requires structured formatsHighConstrain voter output to JSON schema; use binary answer fields for comparison
43x+ cost multiplier per subtaskMediumHaiku voters ($0.003/subtask vs $0.01 for Sonnet single-pass); selective voting for obvious cases
5No programmatic voting primitivesMediumJudge agent handles parsing/tallying via prompt; works for simple output formats
6Judge is a sequential bottleneckMediumBatch subtask results; lead-managed variant for small counts; judge processes results as they arrive
7No temperature control for decorrelationMediumUse prompt variation instead; accept that correlation may be higher than paper’s setup
8Flat team structure (no nesting)LowTwo-variant design accommodates; lead-managed for small, judge-mediated for large
9Narrow applicability relative to other patternsMediumPosition as a specialized tool, not a general-purpose pattern; clear selection guide
10Judge can make errors in tallyingLowStructured JSON makes parsing reliable; simple counting logic is robust
11Cannot implement adaptive sampling (ahead-by-K with dynamic N)LowUse fixed N=3 with optional retry for DISPUTED; simpler and sufficient for most cases

10.2 Trade-offs Relative to Each Existing Pattern

Section titled “10.2 Trade-offs Relative to Each Existing Pattern”
vs. PatternRedundant Consensus Wins When…Other Pattern Wins When…
vs. Parallel SpecialistsYou need one right answer, not many perspectivesYou need comprehensive coverage from different angles
vs. PipelineSteps are independent and votableSteps depend on each other sequentially
vs. SwarmSame input needs high-confidence classificationDifferent inputs need throughput
vs. Research + ImplementationValidation needs statistical confidenceWork requires creative discovery then execution
vs. Plan ApprovalMany binary decisions need automated checkingOne complex decision needs human judgment
vs. Multi-File RefactoringEach file needs validated transformationFiles need coordinated creative changes
vs. RLMInput is small but answer must be reliableInput is large and needs comprehensive analysis

FileDescriptionDependencies
agents/maker-voter.mdVoter agent definitionNone
agents/maker-judge.mdJudge agent definitionNone
skills/maker-consensus/SKILL.mdUser-invocable skill for the patternAgent definitions
docs/patterns.md (modify)Add Pattern 8 entrySkill definition
docs/agent-types.md (modify)Add voter and judge to agent catalogAgent definitions
Phase 1: Agent Definitions (parallelizable)
├── Create agents/maker-voter.md
└── Create agents/maker-judge.md
Phase 2: Skill Definition
└── Create skills/maker-consensus/SKILL.md
(depends on Phase 1)
Phase 3: Documentation Updates (parallelizable)
├── Update docs/patterns.md — add Pattern 8
└── Update docs/agent-types.md — add new agents
(both depend on Phase 2)
CheckMethod
Agent definitions load correctlySpawn each agent type via Task and verify they respond
Voter produces valid JSONSend a test subtask and validate output schema
Judge correctly tallies votesProvide 3 mock votes and verify consensus determination
Lead-managed variant works end-to-endRun a 3-subtask binary classification with 3 voters each
Judge-mediated variant works end-to-endRun a 10-subtask binary classification with persistent judge
Pattern docs are internally consistentCross-reference pattern entry, agent docs, and skill definition
”When NOT to use” guidance is clearReview with someone unfamiliar with the pattern

  1. Pattern numbering: Should this be Pattern 8, or should it be categorized differently (e.g., as a sub-pattern or variant of Parallel Specialists)?

  2. Default voter count: Should the default be 3 (minimum for majority vote) or 5 (more robust but costlier)? The paper uses adaptive sampling, but fixed-N is simpler for Claude Code.

  3. Judge model: Should the judge use Haiku (cheaper, sufficient for simple tallying) or Sonnet (more reliable parsing of complex outputs)?

  4. Scope of initial implementation: Should the first version support only binary YES/NO tasks (simplest case, strongest fit), or attempt to support arbitrary structured outputs from the start?

  5. Prompt variation for decorrelation: Should voters get identical prompts (simpler, cleaner comparison) or varied prompts (better decorrelation but harder comparison)? The paper used temperature variation, which is not available here.

  6. Naming: Is “Redundant Consensus” the right name? Alternatives considered:

    • “Voting Consensus”
    • “Multi-Agent Verification”
    • “Statistical Redundancy”
    • “Consensus Voting”
  7. Catalog placement: Given the narrow applicability (binary/deterministic tasks only), should this be a full pattern or documented as a “technique” or “recipe” at a lower tier than the core seven?

AlternativeDescriptionWhy Not Recommended
Built-in voting hookA Claude Code hook that automatically runs N copies of any agent and votesOver-engineers the solution; most tasks don’t benefit from voting; hooks can’t compare semantic output
Confidence-based single passOne agent with a confidence score; retry if confidence is lowDoesn’t provide statistical guarantees; self-reported confidence is unreliable
Debate pattern (agents argue)Two agents take opposing positions; a judge decidesDifferent topology (adversarial, not redundant); better for subjective questions than binary classification
Ensemble with different modelsUse different models (Haiku, Sonnet, Opus) as votersModel-level diversity is better for decorrelation but much more expensive; harder to compare outputs across capability levels
LLM-as-judge (single evaluator)One agent generates, another evaluatesTwo-agent variant of Plan Approval; no statistical amplification

Appendix A: Worked Example — Security Audit

Section titled “Appendix A: Worked Example — Security Audit”

A complete walkthrough of the lead-managed variant applied to SQL injection classification.

Task: Audit 5 functions for SQL injection vulnerability
Voters: 3 per function (Haiku)
Consensus rule: Ahead-by-2

Team lead creates 5 tasks:

Task 1: "Is getUserById() in src/users.py:15-30 vulnerable to SQL injection? YES/NO"
Task 2: "Is searchProducts() in src/products.py:42-68 vulnerable to SQL injection? YES/NO"
Task 3: "Is updateOrder() in src/orders.py:101-125 vulnerable to SQL injection? YES/NO"
Task 4: "Is deleteComment() in src/comments.py:55-72 vulnerable to SQL injection? YES/NO"
Task 5: "Is getReport() in src/reports.py:200-240 vulnerable to SQL injection? YES/NO"

Three voters independently read src/users.py:15-30 and respond:

Voter 1:

{ "answer": "YES", "confidence": "HIGH", "reasoning": "User input 'user_id' is concatenated directly into SQL string at line 22." }

Voter 2:

{ "answer": "YES", "confidence": "HIGH", "reasoning": "String formatting used for SQL query construction with unsanitized parameter." }

Voter 3:

{ "answer": "YES", "confidence": "MEDIUM", "reasoning": "The query uses f-string interpolation with the user_id parameter." }
Red-flag filter: All votes pass (all HIGH or MEDIUM confidence, all valid JSON)
Vote tally: YES=3, NO=0
Ahead-by-K check: 3-0 = 3 ≥ K(2) → CONSENSUS
Result: YES (CONSENSUS, 3/3 votes)
## SQL Injection Audit Results
| Function | Result | Answer | Votes | Confidence |
|----------|--------|--------|-------|------------|
| getUserById() | CONSENSUS | YES (vulnerable) | 3/3 | HIGH |
| searchProducts() | CONSENSUS | NO (safe) | 3/3 | HIGH |
| updateOrder() | CONSENSUS | YES (vulnerable) | 2/3 | HIGH |
| deleteComment() | DISPUTED | — | 1Y/1N/1U | LOW |
| getReport() | CONSENSUS | NO (safe) | 3/3 | HIGH |
### Summary
- 5 functions audited with 3-voter redundant consensus
- 4/5 reached consensus (80%)
- 1 disputed (deleteComment) — requires manual review
- 2 confirmed vulnerabilities found

ApproachAgentsModelEst. Input TokensEst. Output TokensEst. Cost
Single pass50Sonnet100K25K~$0.50
Single pass50Haiku100K25K~$0.05
3-voter consensus150Haiku300K75K~$0.15
5-voter consensus250Haiku500K125K~$0.25
3-voter consensus150Sonnet300K75K~$1.50

Key insight: 3 Haiku voters cost less than 1 Sonnet pass while providing statistical error reduction. The cost argument only holds when Haiku is the voter model.


Appendix C: Relationship to Existing Research

Section titled “Appendix C: Relationship to Existing Research”
AspectPaperThis Proposal
DomainTowers of Hanoi (mechanical puzzle)Software engineering (binary/deterministic subtasks)
ModelGPT-4.1-miniClaude Haiku (voters), Claude Haiku (judge)
DecompositionAutomated (one move per step)Manual (team lead identifies subtasks)
VotingAdaptive (first-to-ahead-by-K)Fixed N with optional retry for disputes
Red-flaggingToken count + format validationConfidence self-report + JSON format validation
Temperature0 (first), 0.1 (subsequent)Not controllable; use prompt variation
Scale tested1,048,575 stepsProposed for 10–500 subtask range
Error independencePartially addressed via temperaturePartially addressed via prompt variation
InfrastructureCustom orchestration codeClaude Code native primitives
  • LLM-as-Judge: Single evaluator, no statistical amplification. Complementary — could serve as a dispute resolution mechanism within Redundant Consensus.
  • Constitutional AI / RLAIF: Uses multiple models to evaluate outputs, but for alignment training, not runtime consensus.
  • Self-Consistency (Wang et al., 2023): Samples multiple reasoning paths from one model and takes majority vote. Similar principle but operates within a single agent, not across a team.
  • Debate (Irving et al., 2018): Adversarial agents argue opposing positions. Different topology — suited for subjective questions, not binary classification.