Small Agents, Big Results: Tool Use Beats Pure Scale

This study shows small AI + tools > Big AI. If they "think" too much they forget the rules. Skipping tools, spiraling into infinite loops, and outputting wrong answers.

M

MantraVid Admin

March 18, 2026

11 min read10 views
Share:
man-comp3

Small Agents, Big Results: Why Tool Using AI Beats Pure Scale

Here's a question I was wondering that AI researchers answered: If you want a smarter system, should you build a bigger brain or give a smaller one better tools?

A new study from the University of Amsterdam tested this. And it's not what most people expected.

The researchers pitted small AI models (4 to 32 billion parameters) against each other in different configurations. Some worked alone. Others collaborated in teams with access to search engines, code execution, and memory tools. The question was simple: can a group of small, tool using agents beat one big model doing everything itself?

The short answer: yes. Decisively.

What They Actually Tested

The team used the GAIA benchmark, which tests AI systems on real world tasks that require planning, fact finding, calculations, and keeping track of complex information. Think questions like "What's the minimum number of page clicks to get from the Wikipedia page on Lord of the Rings to the page on Game of Thrones?" or "Calculate the population difference between two penguin species using 2018 data, accounting for breeding pairs."

They tested three variables:

  • Model size: From 4 billion parameters up to 32 billion

  • Tool access: Search, code execution, and a "mind map" for remembering intermediate findings

  • Thinking style: No explicit reasoning, planning only, or full step by step thinking throughout

The Two Big Takeaways

1. Tools beat size

A 4 billion parameter model with tools outperformed a 32 billion parameter model without them. That's an 8x size difference overcome entirely by giving the smaller model better equipment.

The 4B model with tools scored 18.2% accuracy on the benchmark. The 32B model without tools? 12.7%.

This matters because smaller models are cheaper to run, faster, and can actually fit on consumer hardware. You don't need a data center. You need better architecture.

2. Thinking is a double edged sword

Here's where it gets interesting. Adding explicit reasoning asking the model to "think step by step" helped in some situations and hurt in others.

When models didn't have tools, thinking consistently improved their performance. It helped them compensate for their limitations.

But when models already had tools, thinking often made things worse. The 32B model actually performed best without any explicit reasoning when it had tool access. Its best score with thinking disabled was 25.5%. With thinking enabled? It dropped to 23.0%.

Model by Model Breakdown: Who Won and Why

The researchers tested five configurations: 4B, 4B Instruct, 8B, 14B, and 32B. Each ran through multiple thinking modes (none, planner only, full) with and without tools. Here's what the numbers actually show.

The 4B Models: Tools Are Everything

The base 4B model without tools scored a dismal 6.1%. With tools but no thinking, it jumped to 13.3%. That's more than doubling performance just by giving it search and code execution.

The 4B Instruct version the same size, but fine tuned to follow instructions it scored higher (9.7% without tools) and peaked at 18.2% with tools and planner only thinking.

Key insight: For small models, instruction fine tuning matters almost as much as tools. The instruct version outperformed the base model in every configuration. But here's the weird part: adding full thinking to the 4B Instruct actually hurt performance, dropping it from 18.2% to 15.8%. The small model couldn't handle reasoning and tool coordination simultaneously.

The 8B Model: The Thinking Sweet Spot

The 8B model tells a different story. Without tools, it was stuck around 6% regardless of thinking. With tools, it climbed steadily: no thinking (10.3%), planner only (12.7%), full thinking (16.4%).

This is the only model where full thinking consistently helped. The researchers suggest 8B hits a sweet spot large enough to coordinate reasoning and tools, small enough that it still needs the reasoning crutch. The 14B and 32B models actually lost performance when thinking was added.

The 14B Model: The Diminishing Returns Begin

At 14B, the pattern shifts. With tools and no thinking: 17.6%. With planner thinking: 19.4%. With full thinking: 20.6%. The gains from thinking are shrinking.

Look at Level 3 tasks specifically the hardest ones requiring long horizon coordination. The 14B scored 0% with full thinking on Level 3. It completely failed the hardest problems when it tried to reason step by step. With planner only thinking, it managed 7.7%. Sometimes less thinking is more.

The 32B Model: The Contradiction

Here's the result that should make researchers pause. The 32B model's best score: 25.5% with tools and no thinking at all.

Adding planner thinking dropped it to 20.6%. Full thinking dropped it further to 23.0%. The largest model they tested, with the most capacity for reasoning, performed best when it just... acted.

On Level 3 tasks, the drop was catastrophic: from 11.5% with no thinking to 3.9% with full thinking. The model that should have been most capable of reasoning was actually disabled by it.

Model

Best Config

Score

Key Takeaway

4B Instruct

Tools + planner thinking

18.2%

Instruction tuning is critical

8B

Tools + full thinking

16.4%

Sweet spot for reasoning

14B

Tools + full thinking

20.6%

Gains from thinking shrink

32B

Tools + no thinking

25.5%

Reasoning becomes harmful

Why Thinking Sometimes Backfires

The paper documents three specific ways that thinking can sabotage an AI with tools:

Skipping the right tool. A model that starts "thinking" might decide it can handle a calculation itself rather than calling the code tool. It can't. It gets the answer wrong.

A 4B model with tools was asked about ASEAN countries with the furthest capitals. Without thinking, it called search and code, got coordinates, calculated distances, and answered correctly. With thinking, it decided search alone was enough and guessed wrong.

Getting stuck in loops. One 32B run with thinking enabled made 15 search queries and never produced an answer at all. The same model without thinking solved it in 3 queries.

Format drift. A thinking model started producing nicely formatted markdown headings instead of the three letter country code the question asked for. It failed a task it could have solved.

The pattern is consistent: thinking helps with planning and constraint checking, but when it messes with tool coordination, everything falls apart.

When Thinking Actually Helps

An 8B model was asked for the minimum clicks between Wikipedia pages. Without thinking, it guessed "3" and was wrong. With thinking, it called search, code, and mind map tools, coordinated them, and got "2" correct.

The difference? In the first case, thinking made the model overconfident in its own knowledge. In the second, it helped the model decompose a problem it couldn't solve by guessing.

What Hardware Did They Actually Use?

The paper mentions this almost in passing, but it's crucial for anyone wanting to replicate or build on this work.

The constraint: GPU memory. They were running on academic hardware, not a data center cluster.

The solution: Instead of loading separate models for each agent role (planner, searcher, coder, memory), they used a single shared model instance with locking. The same physical model sat in memory, and they swapped its role by changing the system prompt. Only one agent could use it at a time.

This means the "multi agent collaboration" they studied wasn't parallel processing. It was sequential role switching. The planner would think, hand off to search, wait, get results, think again, hand off to code, wait, and so on.

The hardware itself: They don't list exact GPUs, but the memory constraints suggest consumer or workstation class cards likely NVIDIA A6000s (48GB) or possibly A100s (40/80GB) if they had access. The key is they couldn't load multiple 32B models simultaneously. That single instance constraint shaped the entire experiment.

Why this matters: If you're building agentic systems on consumer hardware (RTX 3090/4090), this is your reality. You can't load five different 70B models. You have to swap. The paper's findings about thinking causing coordination failures might be exacerbated by this constraint the model has to context switch constantly, and reasoning might make those switches harder.

The Not So Obvious Conclusion: Reasoning Is a Crutch, Not a Feature

Here's what the paper demonstrates that isn't stated directly.

Reasoning is what models do when they don't have better options.

Look at the progression:

  • Without tools, thinking always helps. Models use reasoning to compensate for lack of external capabilities.

  • With tools, thinking becomes optional. Models trade reasoning for action.

  • At 32B, thinking becomes harmful. The model can coordinate tools internally without verbalizing the steps.

This suggests that "chain of thought" and explicit reasoning are workarounds for models that aren't large enough or well trained enough to do implicit coordination. They're training wheels.

The 32B model with tools and no thinking isn't "dumber" than the thinking version. It's more integrated. It processes, plans, and executes in a single fluid operation rather than breaking it into discrete, verbalized steps that create opportunities for error.

What researchers should look at next:

Dynamic reasoning policies.

The paper mentions this briefly, but it's the logical next step.

Imagine a system that:

  • Uses no explicit reasoning for routine tool calls

  • Activates planner only thinking when it detects task decomposition is needed

  • Never enables full thinking across all agents because the paper shows it's almost always worse than planner only

Or even more interesting: train models to internalize reasoning patterns so they don't need to verbalize them. The 32B model's best performance came when it had tools and didn't think out loud. That's not an accident. It's a hint that the ideal agentic model might reason silently, the way humans execute well practiced skills without narrating every step.

The other open question: Would dedicated smaller models outperform the shared instance approach? If you could load a 4B planner, a 4B searcher, and a 4B coder simultaneously letting them work in parallel rather than sequentially would that beat a 32B model swapping roles? The paper couldn't test this due to hardware limits. Someone with a bigger cluster should.

What This Means for Building AI Systems

If you're designing an agentic system one where AI models take actions, not just generate text this research suggests some practical rules:

Give small models tools before you give them more parameters. A 4B model with search and code execution is more capable than a 32B model without them. Scale is expensive. Tools are cheap.

Be selective about reasoning. Don't turn on "think step by step" everywhere. Use it for planning and constraint checking. Turn it off during execution. The models that performed best used thinking only for the planner agent, not for every tool call.

Watch for coordination failures. When your model starts making more tool calls but getting worse answers, you have a coordination problem, not a capability problem. It knows what tools exist. It's just using them badly.

The Bottom Line

The AI industry has spent the last few years in an arms race toward bigger models. This research suggests that race might be partially misguided. Architecture matters. Tools matter. How you coordinate multiple agents matters.

A 4 billion parameter model with good tools and selective thinking can outperform a model eight times its size. That's not just an efficiency win. It's a fundamental challenge to the assumption that scale is the only path forward.

The smarter system isn't always the bigger one. Sometimes it's the one with better equipment and the sense to know when to think and when to just act.

Source

Heres the original paper:

https://arxiv.org/abs/2601.11327v1

Related Posts