DecisionBench: Measuring the Agent Handoff, Not Just the Answer

20 hours ago
9 min read

Updated: 13 hours ago

A benchmark for emergent delegation in long-horizon agentic workflows.

We introduce DecisionBench, a benchmark for emergent delegation in long-horizon agentic workflows. It measures not just whether a task gets solved, but whether an agent hands its subtasks to the right peer model along the way. We characterize the benchmark with a five-condition reference sweep across an 11-model, 7-vendor pool, covering 23,375 task instances, and release the substrate, the annotation layer, the analysis pipeline, and 220 per-condition run archives.

TL;DR

The whole industry is betting on multi-agent systems. Nobody could measure the one thing that makes them work: whether agents delegate to the right model. We built the instrument that does, and it already overturned the conventional wisdom.

The decision layer was invisible, and we made it measurable. Every benchmark stops at "did the task get done." DecisionBench is the first to score the handoff itself, the moment one agent routes a subtask to another.
Quality-only evaluation is flying blind. End-task quality is dead flat across every delegation setup we tried. Score only the outcome and every design looks identical, while the real signal hides in the process, untouched.
The headline finding flips a core assumption: delivery beats content. The same peer information, handed over through a tool instead of pasted into a prompt, doubles routing accuracy (14.2% to 29.5%) at equal quality and lower cost. How you surface knowledge matters more than what it says.
There is enormous performance left on the table. A perfect-delegation ceiling sits 15 to 31 points above where today's systems land, on every benchmark we ran. That gap is the prize.
This is the missing layer of the agent economy, and we are releasing all of it: the substrate, the annotation layer, the analysis pipeline, and 220 run archives. Bring your own router and compete on the same instrument.

Why we built it

Multi-agent systems are now routine. An orchestrating agent works a task and, when it judges a subtask better handled elsewhere, delegates it to a peer model. These handoffs are everywhere in production, and almost none of them are measured.

Three decisions sit inside every delegation: whether to delegate at all, which peer to delegate to, and what to tell the orchestrator about its options. Today those decisions are made by intuition.

Existing agentic benchmarks do not help, because they were built to answer a different question. GAIA measures general tooluse question answering. tau-bench measures multi-turn state tracking under domain policy. BFCL measures function-calling correctness. Each scores the final answer on a fixed task, against a single agent. None of them watch the handoff itself. This matters more than it first appears: a system can route every subtask to a worse-suited peer and still post a respectable end-task score, because on many tasks the orchestrator could have solved the subtask adequately on its own. The poor routing decision is invisible in the outcome. The signal that would reveal it is never recorded.

Work on cost-aware routing comes closest, but it treats routing as a learned external policy bolted onto the system. Multi-agent frameworks typically hand-code who does what. Neither measures whether an agent, given the freedom to delegate, exhibits good delegation behavior on its own. That is the gap DecisionBench is built to close.

What DecisionBench is

DecisionBench is a fixed substrate rather than a single leaderboard. It pins down the parts of the evaluation that need to stay constant, and stays deliberately agnostic about the method being tested, so that learned routers, multi-step delegation, adaptive peer profiles, and richer peer memories can all be compared head to head on the same instrument.

The substrate fixes five things.

A task suite. Long-horizon agentic tasks drawn from GAIA, tau-bench, and BFCL multi-turn, partitioned by a deterministic stratified split into a Stage-1 profiling set and a held-out Stage-2 evaluation set. The three suites span open-ended retrieval, policy-governed dialogue, and structured tool calls, so delegation is tested across genuinely different kinds of work.

A peer-model pool. Eleven models across seven vendor families, including Claude Opus 4.7, GPT-5.5, Gemini-3.1-Pro, and DeepSeek V4, pinned to a fixed freeze date and routed through a single provider so every model is priced under identical market conditions. The pool is intentionally heterogeneous in vendor, size class, and reasoning capability, so that delegation is plausibly useful rather than decorative.

A delegation interface. An orchestrator is an agent running the task loop with a call_model(name, subtask, budget) tool that hands a subtask to a named peer. The peer sees only the subtask string and its own system prompt, returns a result, and the orchestrator continues. A second optional tool, read_profile(model) , returns a structured description of a peer. This is the channel through which peer-information interventions are delivered, and the benchmark does not prescribe what it returns; that is part of the method being evaluated.

An annotation layer. A frozen seven-skill taxonomy and a deterministic, rule-based step tagger that labels each step of a trajectory using only trace signal, no model judgment. This is released as analysis machinery, not as a method we propose. It produces the per-skill pass-rate statistics that define the best-suited peer for any given subtask, which in turn grounds the routing-fidelity and ceiling metrics. An audit against a free-form re-tagging finds that 94.5% of steps map cleanly onto the seven labels.

A metric suite. Per-suite quality, cost, and latency; delegation rate; routing fidelity, the share of handoffs that go to a topranked peer for the relevant skill; vendor self-preference against a chance baseline; and a counterfactual-delegation ceiling that bounds the headroom available to a perfect router. Every metric is computed post-hoc from released traces.

The two-stage protocol. Stage 1 profiles peers and builds skill annotations; Stage 2 runs the orchestration conditions and scores them on the full metric suite.

To characterize the substrate, we instantiate five reference conditions that vary how peer information reaches the orchestrator: a blind baseline with no peer description; three conditions that construct a peer description three different ways (a curated rubric, deterministic Stage-1 statistics, and dual out-of-pool LLM-judge summaries) and preload it into the prompt; and an ablation that supplies the statistics-based description through the read_profile tool only, with nothing preloaded. These conditions are baselines we use to demonstrate the benchmark, not part of its definition.

What we found

We organize the results around the questions DecisionBench is designed to answer about any peer-awareness intervention.

Quality alone tells you almost nothing

Mean end-task quality is statistically indistinguishable across every awareness condition we tested. In a mixed-effects regression over all 23,375 task rows, all four awareness coefficients land within 0.010 of the blind baseline, all with p greater than or equal to 0.21. The largest pairwise contrast between any two conditions is not significant.

An evaluation that scored only the outcome would therefore conclude that how you set up delegation does not matter at all. That conclusion would be wrong, and the reason it is wrong is the central methodological point of the paper: the orchestration signal is in the process, not the outcome. A benchmark that records only how the task ended cannot see it.

End-task quality is flat across conditions on all three suites. The tool-only ablation matches or beats the blind baseline at lower cost and lower delegation count.

Delivery channel dominates description content

When we measure the process directly, asking for each handoff whether the orchestrator picked the peer best suited to the inferred subtask skill, the flat picture breaks open.

Surfacing peer information through an on-demand tool more than doubles routing precision over the blind baseline, from 14.2% to 29.5%, at equal quality and lower mean cost. Pasting the exact same information into the system prompt recovers less than half of that gain: the three preloaded variants land at 7.5%, 20.8%, and 15.5%, with the curated-rubric variant actually falling below blind.

This is the result we did not expect. The same content, delivered two different ways, produces a two-fold difference in routing accuracy. Two mechanisms are consistent with the data: an agent that reads a profile card on demand exercises a moment of discrimination that a preloaded list short-circuits, and preloaded lists appear to surface peer names as pattern-matching cues that substitute for actually reading the cards. The practical recommendation is direct: surface peer-skill information through tools, not preloaded prompt sections.

Adding on-demand tool access doubles delegation fidelity; re-rendering the same information as a preloaded description recovers less than half the gain.

The same asymmetry shows up on the quality side, faintly. Decomposing the awareness effect into a tool-availability component and a system-prompt component, the only quality-side signal that ever turns positive is on the tool-availability axis. Preloading the description never moves quality positively on any cell. The large fidelity gap and the faint quality echo point the same way.

The aggregate hides where orchestration actually helps

Although the aggregate frontiers are nearly superimposed, per-agent movement is not. Counting agents whose best aware variant strictly beats the blind cell on both quality and cost, the tool-only condition wins on the majority of BFCL agents and a meaningful share of GAIA agents. A concave fit of capability against benefit identifies a mid-capability regime where awareness helps most: frontier agents are already strong enough that peer information buys little, the weakest agents lack the discrimination to use it, and mid-tier agents capture the benefit.

Cross-suite generalization is weak. An agent that benefits from awareness on one suite does not reliably benefit on another, and on tau-bench the effect is shallow because agents adhere to domain policy and rarely delegate at all. Orchestration ability is at least partly suite-specific.

Aggregate frontier flatness coexists with per-agent Pareto-favorable movement, concentrated in the mid-capability regime.

There is large unrealized headroom

To bound the room for improvement, we compute a counterfactual ceiling: for each task, what if the orchestrator had delegated to the best-suited peer for the inferred skill, with that peer answering at its empirical pass rate? The ceiling sits 15 to 31 percentage points above measured performance on every suite. The gap is widest on weak and mid-tier agents, and near zero on already-saturated cells. Even under a severe context-loss penalty, the qualitative conclusion survives: the orchestration channel has substantial room, precisely where current awareness interventions fail to capture it.

The perfect-delegation ceiling exceeds measured blind quality by 15 to 31 points on every suite, with the largest gaps on weak and mid-tier agents.

Agents carry a delegation bias

Several orchestrators over-delegate to peers from their own vendor at well above the chance rate implied by pool composition, the delegation-tool analogue of the self-preference bias documented for LLM-as-judge setups. The effect is uneven, and reading it correctly requires care. One model in the pool appears to anti-prefer its own vendor, but inspecting its raw delegations shows the apparent effect is a capability-tier artifact: the model routes the overwhelming majority of its handoffs to frontier-tier peers and avoids every model with a smaller-sibling naming pattern, including its own, even on skills where that sibling's measured pass rate is higher. What looks like vendor avoidance is tier-recognition behavior reading model names rather than the statistics on the card. The released per-task data is what makes this kind of disambiguation possible.

Same-vendor delegation share against chance. Reading the ratios correctly requires separating vendor identity from capability tier.

What this is good for

Because DecisionBench fixes only the tasks, pool, interface, and metrics, a new orchestration method plugs into the same harness and produces directly comparable numbers, not just on final accuracy but on routing fidelity, cost, vendor bias, and headroom.

A learned router enters as a different call_model policy. Multi-step delegation, where the orchestrator chooses a peer per substep rather than once per task, evaluates against the unchanged metrics. So do adaptive profile construction that resummarizes peers during a run, richer peer memories accumulated across tasks, and heterogeneous pools that add smaller or specialist models. In every case the annotation layer and the process metrics carry over unchanged, so comparisons stay anchored even as the space of methods grows.

Limitations

We are releasing this as an initial characterization, and several constraints are worth stating plainly.

Each Stage-2 cell is run once, so seed-to-seed variability is captured only through paired-bootstrap confidence intervals over tasks, not over repeated runs. Delegation fires rarely on tau-bench, where agents adhere to domain policy regardless of peer information, so those numbers reflect prompt priming more than emergent delegation and should not be used to compare methods in isolation. The reference conditions are baselines we use to demonstrate the substrate; they are not the benchmark, and they are not tuned to win. The curated-rubric variant reflects a single curator's synthesis and cannot be cross-checked the way the deterministic and dual-judge variants can. The model pool is pinned to a single freeze date, so absolute numbers will shift under different pricing and availability, even though the cross-condition design is robust to this. And because DecisionBench reuses tasks from existing suites, any contamination in those suites flows through to our measurements.

What we are releasing

We release the substrate, the deterministic annotation layer, the reference intervention suite, the analysis pipeline, and 220 per-condition run archives spanning roughly 23,000 task-level traces. The canonical artifact is a single per-task record file that backs every figure and most tables, so any headline number can be reproduced from one analysis script.

Our aim is to shift how orchestration methods are evaluated, away from final-accuracy-only scoring and toward the processlevel metrics that actually distinguish a good delegation policy from a lucky one.

Paper: arxiv.org/abs/2605.19099 Code and data: huggingface.co/decisionbench