Our experiments demonstrate the potential of LLM-based solutions for agent delegation, highlighting practical techniques such as reranking, and leveraging model probabilities to enhance accuracy and interpretability.
Denys LazarenkoApril 24th, 2025
As we have shown in our blog post on AI-native organizations, real organizational efficiency gains come not just from using AI agents, but from ensuring the right actors — human or AI agent — are assigned to the right tasks at all times.
With specialization advantages between agents and humans reaching 100x, misallocating tasks imposes massive opportunity costs. Conversely, organizations that master optimal agent allocation unlock exponential productivity gains.
At scale, with thousands of tasks and subtasks, manual delegation breaks down. Automation becomes essential. If an agent can reliably complete a task with high accuracy — and user permissions allow — the agent should be auto-selected and assigned.
Automatic, correct delegation becomes the cornerstone of reliable automation.
A few months ago, we introduced self-service agent creation via natural language instructions on our platform.
As more and more users were onboarded, the agent count grew rapidly, and quickly the fundamental delegation problem (Which agent is most suitable for a particular task?) became increasingly hard to solve automatically.
In a perfect world, this problem could have been modeled and solved as a supervised learning optimization — classifying tasks by agents.
In reality, conditions were far more complex due to:
For every task (T), the right agent (A) had to be selected across all domains (D). The complexity of the problem can be expressed as T × A × D — clearly an exponentially hard challenge.
Meanwhile, business needs demanded a solution that was flexible, reliable, and real-time across diverse domains.
Before testing solutions, we first focused on defining how to evaluate them.
Our evaluation aimed to answer:
Following Test-Driven Development principles, we created evaluation datasets.
Initially, we used synthetic datasets across five core domains:
Later, we validated the approach with real customer datasets.
Each agent was characterised by:
To start, we focused on name, description, and sample user requests, creating a synthetic dataset consisting of 70 agents, each assigned five specific tasks. For example:
Ground truth datasets were subsequently generated for evaluation purposes:
Now that we had created the evaluation sets, we started testing different solution approaches.
As a baseline we designed a simple LLM-driven classifier, using the following structured prompt:
Using structured outputs we could then determine the top 3 agents:
This approach initially produced good results:
However, when agents were shuffled, performance declined significantly:
This drop in performance is due to the well-known LLM issue known as "lost in the middle," where the model struggles to retain context from the middle portions of the prompt.
Learning #1: The order of delegates significantly affects classification accuracy. Using a reranker to present the LLM with only the most relevant agents restores accuracy and improves efficiency by reducing token usage:
To improve the baseline approach, the first step was to include user request examples directly in the context of the delegator.
We then analyzed how the quality of Top-1 matches depends on the number of available agents.
This highlighted an important insight:
Scaling agent numbers without smart filtering significantly impacts matching quality.
In summary, we already got quite decent results for an open ended classification problem. For comparison random chance classification is only ~1%.
Despite the open-ended nature of the classification problem, we achieved strong baseline results. For comparison: random chance classification would yield only ~1% accuracy.
Learning #2: Including user request examples in the delegation context significantly boosts performance.
To further improve results, we identified several potential next steps:
During experimentation, we explored methods to measure prediction confidence.
One effective approach: using LogProbs from the model outputs.
To enable this functionality in OpenAI SDK, one can do the following:
LogProbs offer a probability distribution over choices, helping determine when a model is certain versus when it's uncertain.
Examples from our experiments:
The insights gained from LogProbs helped refine when manual review might be necessary — particularly when probability scores between agents were close.
Learning #3: Using LogProbs provides valuable certainty signals and improves evaluation of model confidence.
Learning #4: Implementing a rubric-based confidence scoring system further enhanced clarity and interpretability, offering a more structured way to assess prediction quality:
Our experiments confirmed the strong potential of LLM-based solutions for agent delegation.
Key practical techniques included:
While this solution already significantly improves the user experience on the Interloom platform, its impact reaches beyond that.
In multi-agent systems, an orchestration agent must reliably allocate and coordinate tasks across multiple agents.
Reliability at this stage is critical — it directly determines the overall quality and efficiency of the system and is key to operationalize the AI agent advantages.
We have successfully deployed a similar solution in production, proving it to be: