Agentic Task Delegation

Introduction

As we have shown in our blog post on AI-native organizations, real organizational efficiency gains come not just from using AI agents, but from ensuring the right actors — human or AI agent — are assigned to the right tasks at all times.

With specialization advantages between agents and humans reaching 100x, misallocating tasks imposes massive opportunity costs. Conversely, organizations that master optimal agent allocation unlock exponential productivity gains.

At scale, with thousands of tasks and subtasks, manual delegation breaks down. Automation becomes essential. If an agent can reliably complete a task with high accuracy — and user permissions allow — the agent should be auto-selected and assigned.

Automatic, correct delegation becomes the cornerstone of reliable automation.

Scaling Self-Service Agent Creation

A few months ago, we introduced self-service agent creation via natural language instructions on our platform.

As more and more users were onboarded, the agent count grew rapidly, and quickly the fundamental delegation problem (Which agent is most suitable for a particular task?) became increasingly hard to solve automatically.

In a perfect world, this problem could have been modeled and solved as a supervised learning optimization — classifying tasks by agents.

In reality, conditions were far more complex due to:

A wide and growing range of tasks (T)
A rapidly expanding and potentially unlimited set of user-created agents (A)
Tasks spanning diverse customer domains (D)

For every task (T), the right agent (A) had to be selected across all domains (D). The complexity of the problem can be expressed as T × A × D — clearly an exponentially hard challenge.

Meanwhile, business needs demanded a solution that was flexible, reliable, and real-time across diverse domains.

Evaluation & Solution Approach

Before testing solutions, we first focused on defining how to evaluate them.

Our evaluation aimed to answer:

What is the accuracy of agent selection (Top-1 and Top-3)?
How does accuracy behave as the number of agents scales?

Following Test-Driven Development principles, we created evaluation datasets.

Initially, we used synthetic datasets across five core domains:

Customer support
Marketing outreach
Technical documentation
Sales enablement
Internal communications

Later, we validated the approach with real customer datasets.

Synthetic Dataset Creation

Each agent was characterised by:

Name
Description
Instructions
Tools

To start, we focused on name, description, and sample user requests, creating a synthetic dataset consisting of 70 agents, each assigned five specific tasks. For example:

Synthetic Dataset

Ground truth datasets were subsequently generated for evaluation purposes:

ground-truth-dataset

Now that we had created the evaluation sets, we started testing different solution approaches.

Baseline Approach: Using LLM as a Classifier

As a baseline we designed a simple LLM-driven classifier, using the following structured prompt:

Using structured outputs we could then determine the top 3 agents:

This approach initially produced good results:

However, when agents were shuffled, performance declined significantly:

This drop in performance is due to the well-known LLM issue known as "lost in the middle," where the model struggles to retain context from the middle portions of the prompt.

Learning #1: The order of delegates significantly affects classification accuracy. Using a reranker to present the LLM with only the most relevant agents restores accuracy and improves efficiency by reducing token usage:

Improving the Baseline: Including User Request Examples

To improve the baseline approach, the first step was to include user request examples directly in the context of the delegator.

We then analyzed how the quality of Top-1 matches depends on the number of available agents.

Accuracy dropped from 89% to 74% as the number of agents increased.
The decline was gradual, not abrupt, but clearly demonstrated the need for efficient pre-selection of candidates.

This highlighted an important insight:

Scaling agent numbers without smart filtering significantly impacts matching quality.

In summary, we already got quite decent results for an open ended classification problem. For comparison random chance classification is only ~1%.

Despite the open-ended nature of the classification problem, we achieved strong baseline results. For comparison: random chance classification would yield only ~1% accuracy.

Learning #2: Including user request examples in the delegation context significantly boosts performance.

To further improve results, we identified several potential next steps:

Query expansion
Query decomposition
Model fine-tuning
Adoption of encoder-based architectures (e.g., ModernBERT)

Additional Findings

During experimentation, we explored methods to measure prediction confidence.

One effective approach: using LogProbs from the model outputs.

To enable this functionality in OpenAI SDK, one can do the following:

LogProbs offer a probability distribution over choices, helping determine when a model is certain versus when it's uncertain.

Examples from our experiments:

High Certainty:
- Compliance Review Agent selected with 99.86% confidence — clear, reliable match.
Low Certainty:
- Invoice & Billing Agent selected with 42.56%, closely followed by Legal Research Agent at 37.56% — indicating uncertainty and suggesting the need for additional validation or richer context.

The insights gained from LogProbs helped refine when manual review might be necessary — particularly when probability scores between agents were close.

Learning #3: Using LogProbs provides valuable certainty signals and improves evaluation of model confidence.

Learning #4: Implementing a rubric-based confidence scoring system further enhanced clarity and interpretability, offering a more structured way to assess prediction quality:

Conclusion

Our experiments confirmed the strong potential of LLM-based solutions for agent delegation.

Key practical techniques included:

Reranking candidate agents
Including user request examples
Leveraging model probability outputs for evaluation and decision-making

While this solution already significantly improves the user experience on the Interloom platform, its impact reaches beyond that.

In multi-agent systems, an orchestration agent must reliably allocate and coordinate tasks across multiple agents.

Reliability at this stage is critical — it directly determines the overall quality and efficiency of the system and is key to operationalize the AI agent advantages.

We have successfully deployed a similar solution in production, proving it to be:

Highly scalable
Adaptable across diverse domains
Robust and effective as a foundation for future innovations.