Documentation

How Debates Work

The full debate pipeline from question to synthesis.

Overview

DebateTalk runs multiple AI models against the same question simultaneously, has them deliberate with each other across rounds, and produces a synthesized answer that reflects the consensus they reach. The process is designed to surface disagreements early, pressure-test reasoning, and converge on answers that are more reliable than any single model would produce alone.

Every debate moves through three phases. Phase one is the blind round. Each debater model independently answers the question without seeing what any other model has written. This preserves the independence of their initial reasoning. Phase two is deliberation. Models see each other's anonymized responses and can revise their positions over one or more rounds. The debate continues until consensus is reached or the round limit for your plan is hit. Phase three is synthesis. A separate synthesizer model reads the full debate transcript and writes the final agreed answer.

At the end of each deliberation round, an adjudicator evaluates whether the debaters have reached a sufficient level of consensus. If they have, the debate ends immediately and synthesis begins. If the maximum number of rounds is reached without consensus, synthesis still runs to capture the best available answer from the state of the debate at that point. After synthesis, the adjudicator scores each debater for accuracy.

The debate pipeline is the same regardless of whether you use Auto mode or Manual mode. What changes between modes is how the models are selected, not how the debate itself runs.

All intermediate events are streamed to the client as Server-Sent Events. You receive round-by-round responses from each debater, consensus scores after each round, the synthesis when it is written, and accuracy scores at the end. This means you can observe the full reasoning process as it unfolds rather than waiting for a single final answer.

Question Classification

Before a debate begins, the question is automatically classified into one of six types. Classification shapes the entire debate: it determines which models are selected in Auto mode, what termination trigger the adjudicator applies when evaluating consensus, and the framing that debaters use when reasoning about their answer. Understanding which type your question falls into helps you interpret the debate and the consensus scores.

Factual

Factual questions have objectively verifiable answers grounded in data, research, or established scientific or historical knowledge. The debate is looking for convergence on what is true. Example: "What are the proven health effects of caffeine?" Factual questions benefit most from models with strong knowledge recall and accuracy. Disagreements in factual debates often resolve quickly once models share their evidence.

Normative

Normative questions ask about ethics, values, policy, or what ought to be done. They do not have a single objectively correct answer. Instead, the debate is seeking alignment on a reasoned ethical framework and the conclusions it supports. Example: "Should AI-generated content be labeled?" Normative questions tend to produce the most substantive deliberation because different models may begin with genuinely different moral frameworks and must work through those differences explicitly.

Business

Business questions are strategic or operational in nature. They concern risk, market dynamics, competitive positioning, or organizational decisions. Example: "Should a Series A startup expand internationally in year one?" The debate is seeking agreement on the key risks, trade-offs, and a recommended course of action. Business questions often benefit from a mix of analytical and creative models because the best answers combine rigorous risk assessment with lateral thinking about second-order effects.

Prediction

Prediction questions are forward-looking. They ask about future outcomes, probabilities, or likelihoods. Example: "Will electric vehicles account for more than 50% of new car sales by 2035?" Consensus for prediction questions is measured by whether the models' confidence intervals and probability estimates converge. A debate where all models land within a similar probability range has reached consensus even if the exact numbers differ.

Brainstorm

Brainstorm questions are open-ended. They seek a breadth of ideas rather than a single correct answer. Example: "What are unconventional marketing strategies for a B2B SaaS product?" Brainstorm debates are structured differently from analytical ones. Consensus is not about agreement on a single answer but about whether the space of ideas has been sufficiently explored. The debate ends when models stop introducing distinct new ideas that others have not already acknowledged.

Belief

Belief questions sit at the intersection of personal, cultural, or philosophical worldviews. They are not purely factual or normative but touch on foundational commitments about the nature of reality, consciousness, or human experience. Example: "Does consciousness require a biological substrate?" These questions rarely reach full consensus. Instead, the debate seeks to identify shared underlying principles or frameworks even when surface-level conclusions differ, and the synthesis reflects the genuine state of that partial agreement.

In Auto mode, classification happens first, then model selection follows. In Manual mode, classification still happens and the result is passed to the debaters as context, but it does not affect which models participate because you have already chosen them.

Termination Triggers

Each question type has a corresponding termination trigger. The termination trigger is the definition of consensus that the adjudicator applies when evaluating whether the debate should end. It is not a binary check. It describes the specific shape of agreement the adjudicator is looking for given the nature of the question.

Understanding termination triggers matters because it explains why some debates end in one round while others run to the maximum. A factual question where all models cite the same research and agree on the conclusions can satisfy Metric Convergence after the blind round alone. A belief question where models are working through fundamentally different philosophical frameworks may reach its round limit before the adjudicator is satisfied, and synthesis runs anyway.

Metric Convergence (Factual)

The adjudicator evaluates whether the models agree on the core facts, their relative importance, and the conclusions those facts support. Minor differences in emphasis or phrasing do not prevent termination. What matters is that no model is asserting a factually contradictory claim that others have not acknowledged and addressed.

Value Alignment (Normative)

The adjudicator evaluates whether the models have converged on a shared ethical framework and whether that framework leads them to the same conclusion. It is possible for two models to apply different ethical frameworks and still reach the same conclusion, which also satisfies this trigger. The key signal is that the models are no longer in substantive disagreement about what ought to be done.

Risk Stability (Business)

The adjudicator evaluates whether the models agree on the key risks and their relative likelihood, and whether they are recommending the same course of action or a sufficiently similar one. Debates on complex business questions often converge on the risks quickly but diverge on the recommended response. Risk Stability is only satisfied when both the diagnosis and the prescription are aligned.

Probability Convergence (Prediction)

The adjudicator evaluates whether the models' stated confidence intervals and probability estimates overlap within an acceptable range. A model that says "60 to 70% likely" and a model that says "55 to 65% likely" have overlapping intervals and are converging. A model that says "30% likely" while others say "70% likely" has not converged and deliberation continues.

Coverage Saturation (Brainstorm)

The adjudicator evaluates whether the debate is still generating distinct new ideas. In early rounds, models surface different angles and approaches. As rounds continue, new responses start to acknowledge and build on existing ideas rather than introducing genuinely new ones. Coverage Saturation is reached when the marginal idea contribution across models drops below a meaningful threshold, indicating the idea space has been sufficiently explored.

Framework Consensus (Belief)

The adjudicator evaluates whether the models have identified a set of underlying principles or assumptions they share, even if their surface-level framings differ. Belief debates rarely produce full convergence on the conclusion. Framework Consensus acknowledges this and focuses instead on whether the models have articulated a shared intellectual foundation from which their different conclusions follow. This is the most nuanced termination trigger and the hardest to satisfy.

Debate Roles

Three distinct roles participate in every debate. Each role has a specific function and the separation between them is intentional. A model that debates should not also judge whether consensus has been reached. A model that judges should not also write the synthesis. Keeping these roles separate prevents any single model's perspective from dominating the outcome.

Debaters

Debaters are the primary reasoning models. They receive the question, produce structured answers, defend their positions across rounds, consider counterarguments from other debaters, and revise their views when presented with compelling evidence or reasoning. The quality of the debate depends heavily on the diversity of the debater panel. Models from different providers, trained on different data with different optimization objectives, will surface genuinely different perspectives.

You can configure between 2 and 10 debaters depending on your plan. With 2 debaters you get a direct exchange between two perspectives. With 3 or more you get a richer discussion but more rounds are typically needed before consensus. For most questions, 3 debaters provides a strong balance between diversity and efficiency. 5 or more debaters is most valuable for brainstorm questions and complex strategic decisions where covering a wide surface area matters.

Adding more debaters increases both latency and cost proportionally. A debate with 6 debaters takes roughly twice as long and costs roughly twice as much as one with 3. Start with 3 and scale up only when the additional diversity is needed.

Adjudicator

The adjudicator does not participate in the debate itself. Its role is to evaluate the state of consensus after each round and to score debater accuracy at the end. On the Free plan, consensus evaluation uses an algorithmic approach based on the structured fields in each debater's response. On Pro and Enterprise plans, you can configure any supported model as the adjudicator. A model-based adjudicator can assess nuances that are difficult to capture algorithmically, such as whether two models are genuinely aligned on a normative framework or are simply using compatible language to describe incompatible positions.

Choosing a strong adjudicator matters most for normative, belief, and complex business questions where consensus is qualitative rather than quantitative. For factual and prediction questions, the algorithmic approach on Free is often sufficient because convergence is easier to measure.

Synthesizer

Once the debate reaches consensus or exhausts its round limit, the synthesizer writes the final answer. It receives the full debate transcript, including the blind round responses and all deliberation round responses, presented anonymously. Its task is to distill what the debaters agreed on into a single clear answer that does not privilege any individual model's phrasing or framing.

The synthesizer is always a different model from the debaters. This separation ensures that the final answer is a genuine distillation rather than a reproduction of whatever the most influential debater said. The synthesis is not a summary of the debate process. It is a direct answer to the original question that reflects the consensus position.

If the debate exhausted its round limit without reaching consensus, the synthesizer still receives the full transcript and writes the best available answer. The synthesis in this case will reflect the closest agreement reached and may acknowledge remaining uncertainties where they are genuine.

The Blind Round

The first round of every debate is a blind round. Each debater receives the question and its classification. No model sees what any other model has written. This constraint is not an arbitrary restriction. It is the most important structural choice in the debate design.

When models see each other's answers before forming their own, they anchor to the first answer they encounter. This is true even for large language models. A model that sees a well-articulated position before reasoning from scratch will tend to validate and build on that position rather than independently generating its own. The result is premature convergence: the models appear to agree not because they reached the same conclusion through independent reasoning but because the first answer set the frame for all subsequent ones.

The blind round surfaces the full diversity of perspectives that exist across the panel before any deliberation begins. Only once each model has committed to its own answer does the deliberation phase start, and at that point the models are negotiating between positions they genuinely hold rather than refining a shared first draft.

Each debater's blind round response is a structured object containing its direct answer to the question, a list of key claims with individual confidence scores for each claim, an overall confidence score for the answer as a whole, and any stated assumptions or uncertainties that the model believes are material to its answer. These structured fields are what the adjudicator uses to begin measuring consensus.

json
{
  "answer": "The primary proven health effects of caffeine include...",
  "claims": [
    { "claim": "Caffeine improves short-term alertness and reaction time", "confidence": 0.97 },
    { "claim": "Regular consumption above 400mg/day is associated with increased anxiety", "confidence": 0.88 },
    { "claim": "Caffeine has a protective association with Parkinson's disease risk", "confidence": 0.82 }
  ],
  "overall_confidence": 0.91,
  "assumptions": [
    "References healthy adults without pre-existing cardiovascular conditions",
    "Based on peer-reviewed research as of the model's training cutoff"
  ]
}

The blind round is always round one, regardless of plan or configuration. There is no way to skip it. It is the foundation on which the rest of the debate is built.

Deliberation Rounds

From round two onwards, the debate enters the deliberation phase. Each debater now receives two things: its own answer from the previous round, and anonymized versions of all other debaters' most recent responses. The anonymization is a deliberate design choice. Models see "Model A" and "Model B" as identifiers rather than the actual provider or model name. This prevents social bias from distorting the deliberation.

Without anonymization, a model might defer to what it perceives as a more authoritative or prestigious model rather than engaging with the argument on its merits. Anonymization ensures that every position is evaluated on the strength of its reasoning, not the reputation of the model that produced it.

Each debater's deliberation response is structured to make engagement explicit and auditable. The response includes what the model learned from others' answers that it did not know or had not considered, what it corrected in its own previous answer as a result of the deliberation, what positions or claims from other models it still challenges and why, and what genuine points of disagreement it believes remain unresolved. These fields are what allow the adjudicator to measure whether consensus is actually progressing or whether models are only superficially acknowledging each other without changing their positions.

json
{
  "answer": "Revised position: The health effects of caffeine are well-established for alertness...",
  "claims": [
    { "claim": "Caffeine improves short-term alertness and reaction time", "confidence": 0.97 },
    { "claim": "The cardiovascular risk threshold is individual-dependent, not a fixed 400mg/day figure", "confidence": 0.79 }
  ],
  "overall_confidence": 0.88,
  "learned_from_others": "Model B's point about individual metabolic variation affecting cardiovascular risk is well-supported and I have updated my claim accordingly.",
  "corrections": "I overstated the certainty of the 400mg/day figure. The research shows this is a population-level average with high individual variance.",
  "still_challenges": "Model A's claim that caffeine has a protective effect on Parkinson's risk I believe is correlational rather than causal. I maintain this distinction matters.",
  "unresolved": "The degree to which tolerance development affects long-term cognitive benefits remains genuinely contested."
}

The debate continues until the adjudicator's consensus score crosses the required threshold or the maximum number of rounds is reached. The maximum depends on your plan. Free plans allow up to 2 rounds total (1 blind plus 1 deliberation). Pro plans allow up to 4 rounds. Enterprise plans allow up to 10 rounds. Longer debates are most valuable for complex questions where the models have genuine substantive disagreements to work through.

Reaching the round limit without consensus does not mean the debate failed. It means the models had genuine disagreements that could not be resolved in the allotted rounds. Synthesis still runs and produces the best available answer. The consensus scores in the history record will show you how close the models got.

Devil's Advocate

Starting in round three, debaters are assigned an additional task alongside their regular deliberation. Each model is asked to identify the single strongest flaw, gap, or counterargument in the current majority position, even if it personally agrees with that majority view.

This mechanism exists to prevent debates from collapsing into shallow consensus. As rounds progress, models naturally begin to align. This is the intended behavior. But premature alignment, where models converge before they have genuinely tested the emerging consensus, produces worse answers than full disagreement. The devil's advocate task creates structured pressure to find weaknesses even in positions the model supports.

A model performing the devil's advocate task might write: "The majority position assumes that consumer preferences for EVs will remain stable through 2035, but this assumption is vulnerable to a scenario where battery supply chain constraints produce significant price volatility in the early 2030s, which the current consensus has not adequately addressed." The model may still believe the majority prediction is correct. But by surfacing this vulnerability, it forces the other debaters to either address it or explicitly acknowledge it as an accepted risk in the final synthesis.

Devil's advocate contributions appear in the structured response alongside the model's regular deliberation fields. The adjudicator takes them into account when evaluating consensus. A debate where the devil's advocate task is consistently producing strong unaddressed counterarguments will not satisfy the termination trigger, even if the main positions have converged. Consensus must be robust to the strongest available objection, not just internally consistent among the debaters.

Devil's advocate only activates from round three. On Free plans with a maximum of 2 rounds, it never fires. If thoroughness under adversarial pressure is important for your use case, Pro or Enterprise plans with at least 3 rounds are required.

Consensus Scoring

After each round, the adjudicator scores the degree of consensus across four dimensions. These four scores are combined equally to produce an overall consensus score. The overall score is compared against the termination threshold for the relevant question type. When it crosses that threshold, synthesis begins.

The four dimensions each capture a different aspect of agreement. High consensus on one dimension with low consensus on another indicates the shape of the remaining disagreement and guides the next round of deliberation.

Stance Alignment

Stance Alignment measures how closely the debaters' final positions agree at the level of their stated conclusion. Two models that both conclude "yes, this policy should be adopted" have high stance alignment. Two models where one says "yes with significant caveats" and another says "no under current conditions" have low stance alignment. This is the most direct measure of whether the models are actually agreeing.

Empirical Overlap

Empirical Overlap measures how much the factual claims and cited evidence from each model intersect with those of the others. Models that are citing different studies, different data points, or different historical examples to support their answers may reach the same conclusion but are doing so on different evidential foundations. High empirical overlap means the models are working from a shared factual basis, which makes the consensus more robust.

Framework Agreement

Framework Agreement measures whether the models share a common analytical or reasoning framework. This dimension matters most for normative, business, and belief questions. Two models can agree on a conclusion while applying entirely different reasoning frameworks, and the synthesis of such a debate needs to acknowledge this. Conversely, two models that apply the same framework are likely to resolve remaining disagreements in subsequent rounds because they are operating from the same analytical foundation.

Confidence Convergence

Confidence Convergence measures whether the models' stated confidence scores are approaching one another across rounds. A panel where one model reports 0.95 confidence and another reports 0.45 confidence is not converging on a reliable answer, even if their stated positions are similar. Genuine consensus requires not just agreement on what is true but approximate agreement on how certain that truth is.

You can observe consensus progress in real time through the consensus SSE events emitted after each round. Each event includes the per-dimension scores, the overall score, whether the termination threshold was crossed, and if not, which dimensions are lagging and why.

json
{
  "event": "consensus",
  "round": 2,
  "scores": {
    "stance_alignment": 0.87,
    "empirical_overlap": 0.79,
    "framework_agreement": 0.81,
    "confidence_convergence": 0.72
  },
  "overall": 0.80,
  "terminated": false,
  "lagging_dimensions": ["confidence_convergence"],
  "notes": "Models are broadly aligned on conclusion and evidence but confidence intervals remain wide. One model expresses significantly higher uncertainty about the long-term market dynamics."
}

Synthesis

When the debate ends, either because consensus was reached or because the round limit was hit, the synthesizer model produces the final answer. Synthesis is the last step before accuracy scoring and it produces the output that most callers use as the debate result.

The synthesizer receives the full debate transcript. This includes the blind round response from each debater and the final round response from each debater. Intermediate rounds are not omitted from the context, which means the synthesizer can observe not just where the models ended up but how they got there. A model that dramatically revised a position mid-debate provides different signal than one that held its initial answer throughout.

All debater responses are presented to the synthesizer anonymously. The synthesizer does not know which provider or model name corresponds to which position. This prevents it from unconsciously weighting one model's phrasing over another based on perceived authority.

The synthesis is a direct answer to the original question. It is not a meta-commentary on the debate, not a summary of what each model said, and not a hedge that lists all the positions without taking one. The synthesizer's job is to write the answer that the debate converged on, stated as clearly and concisely as possible. Where the models identified genuine unresolvable uncertainty, the synthesis will say so, but it will not manufacture uncertainty to be safe where the models actually agreed.

The synthesis is streamed to the client as a synthesis SSE event and is also stored in the debate history record. It is the primary output of the debate and the starting point for most downstream use cases.

For brainstorm questions, synthesis works differently from analytical question types. Rather than distilling a single agreed conclusion, the synthesizer organizes the ideas that received the broadest acknowledgment across the panel into a structured output, ranked by how many debaters independently surfaced or endorsed each idea.

Accuracy Evaluation

After synthesis, the adjudicator evaluates each debater's performance individually. This evaluation produces accuracy scores that reflect how well each model contributed to the debate, not just whether it ended up on the winning side of the consensus.

Three dimensions are scored on a 0 to 100 scale for each debater.

Factual Accuracy

Factual Accuracy measures how well the model's claims aligned with the consensus and with verifiable facts. A model that asserted claims that were later corrected by other debaters, or that the adjudicator identifies as factually contested, will score lower on this dimension. A model whose claims were consistently validated and built upon by the debate scores higher. This dimension is most meaningful for factual and prediction question types.

Logical Consistency

Logical Consistency measures whether the model's arguments were internally coherent and free of contradictions across rounds. A model that reversed a position without acknowledging the reversal, or that held contradictory claims simultaneously, will score lower. This dimension rewards intellectual honesty: a model that explicitly identifies when and why it changed its mind scores higher than one that silently shifts position without acknowledging the change.

Evidence Quality

Evidence Quality measures how well the model supported its claims with reasoning, examples, analogies, or data. A model that asserts conclusions without supporting them scores lower than one that explains its reasoning in depth. This dimension does not require the model to cite external sources. Strong logical argumentation, well-constructed analogies, and clearly stated inference chains all contribute positively to Evidence Quality.

Trajectory Score

The Trajectory Score is distinct from the three accuracy dimensions. It measures how constructively the model engaged with the debate process itself across rounds. A model that started with an incomplete initial answer but meaningfully revised it in response to other debaters' arguments scores higher on trajectory than one that had a strong initial answer but ignored everything the other models said. A high Trajectory Score indicates a model that is genuinely responsive to argument rather than one that simply outputs its initial conclusion and waits for the debate to end.

Trajectory Score matters because the value of a debate system comes precisely from models updating on each other's reasoning. A model with a high Trajectory Score is contributing to the process. A model with a low Trajectory Score, even if its initial answer was correct, is not making the debate better.

All accuracy scores are available in the accuracy SSE event at the end of the debate and in the full debate history record. The scores are presented per-model, anonymized by the same letter identifiers used in the debate transcript.

json
{
  "event": "accuracy",
  "scores": {
    "model_a": {
      "factual_accuracy": 91,
      "logical_consistency": 88,
      "evidence_quality": 85,
      "trajectory_score": 72
    },
    "model_b": {
      "factual_accuracy": 83,
      "logical_consistency": 94,
      "evidence_quality": 90,
      "trajectory_score": 88
    },
    "model_c": {
      "factual_accuracy": 87,
      "logical_consistency": 81,
      "evidence_quality": 78,
      "trajectory_score": 95
    }
  }
}

Auto vs Manual Mode

DebateTalk supports two modes for configuring a debate panel. The mode affects how models are selected. It does not affect how the debate itself runs. Classification, blind rounds, deliberation, devil's advocate, consensus scoring, synthesis, and accuracy evaluation all work identically in both modes.

Auto Mode

In Auto mode, DebateTalk classifies your question and then selects the best available models for that question type. The selection logic applies two priorities simultaneously. First, it favors models that have historically performed well on questions classified in the same domain, drawing on accuracy scores accumulated across all debates on the platform. Second, it enforces diversity across AI providers. A debate panel will not be filled with models from the same provider, because models from the same provider tend to have similar training backgrounds and will naturally agree more quickly, reducing the value of the debate.

Auto mode is the recommended starting point for most use cases. It removes the need to research individual model capabilities and regularly incorporates new models as they become available and establish a performance track record.

Manual Mode

In Manual mode, you configure the debaters, adjudicator, and synthesizer yourself via the model configuration parameters in the API request. You can mix any combination of supported models. Manual mode is appropriate when you want to compare specific models head-to-head, when you are building a product that always uses the same fixed panel, or when you are benchmarking a new model against established ones to understand how it performs relative to the field.

Manual mode requires more configuration knowledge but gives you complete control over the composition of the panel. You are responsible for ensuring diversity: if you configure three models from the same provider, the debate may converge faster but the consensus will reflect a narrower range of perspectives.

Switching between Auto and Manual mode is a per-request choice. You can use Auto mode for exploratory questions and Manual mode for structured benchmarks or product integrations, within the same API account.

Debate Lifecycle

A debate moves through a defined set of states from the moment it starts to when it is stored in history. Understanding the lifecycle helps you build reliable integrations and debug unexpected behavior.

States

running is the active state. The debate is currently processing and streaming SSE events to the client. Events arrive in order: round events as each debater completes its response, consensus events after each round, a synthesis event when the synthesizer finishes, and an accuracy event with the final scores.

completed means the debate finished normally. Either the consensus threshold was crossed and synthesis ran, or the round limit was reached and synthesis ran anyway. The history record is fully populated and all SSE events were emitted. This is the expected terminal state for a successful debate.

failed means an unrecoverable error occurred during the debate. The most common cause is all debaters in a given round returning errors simultaneously, for example when multiple model provider APIs are unavailable. When this happens, the debate transitions to failed immediately rather than attempting to continue with partial responses. A partial transcript may be stored in the history record but synthesis does not run.

aborted means the client disconnected before the debate finished. SSE connections are stateful. If the connection drops mid-debate, the backend records the state at the point of disconnection. A partially completed debate in aborted state will have whatever rounds completed before the disconnect stored in history, but synthesis will not run.

History and Storage

Authenticated debates are stored in history regardless of their final state. A failed debate and a completed debate are both stored. The status field in the history record tells you the terminal state. Unauthenticated requests are not stored. The debate history API returns the full transcript, consensus scores per round, synthesis, accuracy scores, and metadata including the question classification, the number of rounds run, and the terminal state.

The Final SSE Event

The last SSE event in every debate is a final event. It contains the debate ID, the terminal status, a reference to the synthesis, the overall consensus score at termination, and a brief summary of the debate run including how many rounds were completed and whether consensus was reached or the round limit was hit. The final event is your signal that the stream is complete and the history record is fully written.

json
{
  "event": "final",
  "debate_id": "dbt_01j9k2m3n4p5q6r7s8t9",
  "status": "completed",
  "rounds_completed": 3,
  "consensus_reached": true,
  "final_consensus_score": 0.86,
  "synthesis_id": "syn_01j9k2m3n4p5q6r7s8u0"
}
If you do not receive a final event, the connection was interrupted before the debate completed. Check the debate history API using the debate ID from the initial response. The status field will reflect whether the debate completed on the backend after your client disconnected.