Stop Treating Tool Selection Like Document RAG: Start With BM25

If your agent has dozens—or thousands—of callable tools, picking the right one isn’t a vibes-based decision. It’s retrieval. And the default playbook many teams reach for—semantic embeddings first, everything else later—may be pointing in the wrong direction. At AI Tech Inspire, we spotted a conversation-starter of a pattern: for large tool catalogs, a simple lexical baseline like BM25 can beat semantic matching for primary tool selection. Here’s why that matters, and how to put it to work.

What the data says (quick bullets)

Tool selection at scale is a retrieval task; you can’t pass every tool definition to the model each turn without tanking token budgets and accuracy.
As the catalog grows, selection accuracy drops and token cost of carrying definitions dominates; retrieving a subset per request becomes necessary.
Document-style defaults (semantic embeddings as primary) underperformed a lexical baseline in testing for tool selection.
Reason: tool descriptions are short and structurally similar; the deciding signal is often a single keyword or schema property (e.g., repo_id, channel), which cosine similarity tends to blur.
BM25 over a flat-text projection of a tool’s name, description, and a walk of its input/output schema preserved the discriminating keywords and ran fully in-process.
Tools live in a smaller, more structured space than documents; the signal is “keyword-shaped,” so lexical should be primary; semantic/hybrid can be an optional layer for fuzzier queries or very large catalogs.
An open benchmark scores this on a 43,000-tool corpus with labeled relevance: github.com/ratel-ai/ratel. Comparable to ToolRet-style evaluations.
Open question: at 200+ tools and above, where (and when) do semantic or hybrid approaches reliably outperform BM25 on primary selection?

Tool selection is retrieval, not prompt decoration

In a function-calling loop—whether via GPT function calling, server-side orchestrators, or MCP-style interfaces—you can’t expose every tool on every turn once the catalog gets big. Even if you could afford the tokens, selection accuracy usually suffers as the option set balloons. The pragmatic fix is to retrieve a small candidate set that the model can rank or call.

That shift reframes the problem: it’s not about beautifully formatted tool cards; it’s about scoring and ranking relevance. And when the text you’re ranking is short, repetitive, and schema-driven, different retrieval dynamics kick in than what teams see in long-form document RAG.

Why semantic can underperform for tools

Document RAG defaults assume paragraph-length chunks with rich, redundant phrasing. Embeddings shine when meaning is distributed across sentences and synonyms. Tool metadata is the opposite: small, terse, and repetitive by design. Descriptions like “list the open issues” and “list the channel messages” share most tokens; one decisive noun (“issues” vs “channel”) changes the outcome. Cosine similarity over near-identical strings tends to smear that difference.

Meanwhile, schema details often carry the decisive signal. If a tool takes repo_id or channel as a required parameter, that single property name can matter more than any natural-language gloss. Embedding models trained on sentence-level semantics don’t necessarily weight these structural hints as strongly as a keyword-focused ranker would.

Tools live in a smaller, more structured space. The discriminating signal is keyword-shaped—and that’s exactly what BM25 is good at.

A BM25-first recipe that works

The tested approach that stood out used BM25 over a flattened representation of each tool:

Concatenate name + description + a shallow “walk” over input and output schema fields (including property names and types).
Index this flat text and rank candidate tools against the user’s immediate instruction (“list open issues for repo X”, “fetch last 20 messages from channel Y”).
Return top-k tools to the model for final selection or direct calling.

This setup keeps the decisive tokens crisp, runs locally (no embedding service required), and is easy to tune. You can implement it with Lucene-based stacks, tantivy, Whoosh, or any search library that exposes BM25 and, optionally, field boosts (bm25f style) to weight schema terms more heavily than descriptive prose.

When to add a semantic or hybrid layer

None of this argues that embeddings are useless—only that they shouldn’t be the default primary for tool selection. Consider a semantic or hybrid layer when:

Your catalog is very large (hundreds to tens of thousands) and keyword collisions rise.
Users express fuzzy intent (“prepare this dataset for training”) where multiple tools could be plausible and synonyms matter.
You want to rerank a BM25 shortlist, not replace it—semantic rerankers can help untie close lexical scores.

A practical hybrid: use BM25 to pull the top 20–50 candidates, then apply a lightweight embedding rerank or a short GPT-style judge pass that inspects the candidate schema against the request. Keep the primary recall lexical; let semantic sharpen ambiguous cases.

How to try this in your stack (quick start)

Flatten metadata: emit a per-tool string including name, description, and all input/output fields (names, types, requirements). Example: Tool: list_channel_messages | Params: channel (string, required), limit (int).
Index with BM25: use Elasticsearch/OpenSearch, Lucene, tantivy, or a simple in-process library. If supported, boost schema fields above descriptions.
Query construction: pass the raw user instruction; optionally append detected entities (repo_id=..., channel=...) from lightweight NER or regex.
Candidate size: start with k=10–30 for typical catalogs; grow k with catalog size. Feed these to the model for ranking/selection.
Optional rerank: apply a small embedding reranker or a short LLM judge over the BM25 shortlist for fuzzy queries.
LLM prompt: show only the final shortlist plus minimal schemas. Avoid dumping the whole catalog each turn to save tokens and reduce confusion.

For teams with tools that touch Hugging Face resources—where fields like repo_id or dataset names dominate—lexical matching tends to be especially strong because the key terms are explicit and unambiguous.

What to measure (and why it matters)

Precision@1 / Recall@k: Does the correct tool land at rank 1? Is it at least in the shortlist? Track both.
Latency budget: BM25-in-process is fast and predictable. Embedding roundtrips and semantic reranking add milliseconds-to-seconds—plan for it.
Token spend: How many tokens per turn are spent on tool context before any “real” work? Lexical-first pipelines often cut this dramatically.
Failure analysis: When the wrong tool wins, is it because of overloaded tokens (“list”, “update”), fuzzy intent, or missing schema keywords? This guides boosts and when to enable semantic rerank.

Even for advanced orchestration—pre/post hooks, multi-step planners, or function trees—the retrieval layer remains the choke point. Better primary retrieval means fewer planner retries, fewer hallucinated calls, and less prompt bloat.

Edge cases and tuning tips

Overlapping verbs: Many tools start with the same verbs (list, get, update). Favor boosting nouns (objects, resource types) and schema fields over generic action words.
Short names, long schemas: If names are cute but uninformative, the schema walk does the heavy lifting—ensure it’s indexed.
Query expansion: Light heuristics help: map “DM” → “direct message” → channel, “repo” → repository/repo_id.
k selection: If the model struggles to pick from too many similar tools, shrink k or add a semantic rerank to separate near-ties.
Dynamic catalogs: Use incremental indexing; hot-reload indexes when tools are added or updated to keep retrieval fresh.

The benchmark—and the open question

There’s an open benchmark over a 43,000-tool corpus with labeled relevance here: ratel-ai/ratel. Results reported there align with the core claim: lexical first wins as a primary selector for tool catalogs, with semantic reserved for fuzzy intents or post-BM25 reranking.

The interesting cliff is somewhere beyond 200 tools—where exactly does semantic start to pay rent as a primary signal, not just a reranker? Catalog composition, naming discipline, and schema style likely shift the crossover point. If your tools have rich, human-like descriptions and sparse schemas, embeddings may climb sooner. If they’re terse and schema-heavy, BM25 will likely keep the lead longer.

Bottom line

Don’t import document RAG defaults into tool selection without testing. For most agent stacks today, the fastest, cheapest, most accurate starting point is lexical: BM25 over a schema-aware projection, with optional semantic rerank for the edge cases. It saves tokens, reduces confusion, and gets you closer to the right call on the first try. And if you’re already running semantic-first, this is an easy A/B: swap in BM25 for primary retrieval and see whether your Precision@1 and token spend improve.

At AI Tech Inspire, the takeaway is practical: treat tool selection like the structured retrieval problem it is. Start lexical, measure ruthlessly, and add semantic only where it clearly earns its keep.

Recommended Resources

As an Amazon Associate, I earn from qualifying purchases.

Raspberry Pi Kits

Edge AI & robotics.

The Hundred-Page LLMs Book (PyTorch)

Hands-on LLMs.