16 Ways to make a Small Language Model think bigger

All of the code in this article is available in the Oracle AI Developer Hub. The repository is part of Oracle’s open-source AI collection and serves as the reference implementation for everything covered here.

You can install it with pip install agent-reasoning, browse the 16 agent classes, run the TUI, or integrate it directly into an existing Ollama pipeline as a zero-change replacement client. If you find it useful, a GitHub star goes a long way.

Key Takeaways

Small language models struggle with complex reasoning on their own, but agent-based architectures (like Tree of Thoughts or Self-Consistency) can significantly improve their performance.
The agent-reasoning framework adds 16 research-backed reasoning strategies to any Ollama model using a simple +strategy tag—no code changes required.
Different strategies suit different tasks: CoT works well overall, ReAct excels with external data, and branching methods improve accuracy at the cost of speed.
Much of modern AI progress comes from orchestration (prompting, search, control flow), not just larger models.

Generally, a 270M parameter LLM (as of today, April 2026) struggles with even basic multi-step reasoning. Ask a model like gemma3:270m to solve the classic water jug problem, and it will often return a confidently incorrect answer—much like other small language models (SLMs) of similar size and training.

However, take that same model and wrap it inside a Tree of Thoughts (ToT) agent, running a breadth-first search (BFS) with three levels and weighted branches, and it can reliably solve the puzzle. The improvement comes from the architecture: the agent distributes the reasoning process across structured exploration steps, compensating for the limitations of a single LLM call.

This is where things get interesting. Much of the progress in applied AI isn’t coming from bigger models alone, but from engineers rethinking how to orchestrate them—layering search, memory, and control flow on top of a standard LLM call to unlock new capabilities.

This is the fundamental idea behind agent-reasoning: sixteen cognitive architectures—each backed by peer-reviewed research—can be applied to any Ollama-served model via a simple +Strategy tag appended to the model name. Call gemma3:270m+tot instead of gemma3:270m, and the interceptor handles everything else.

We’ll talk about the different ways to invoke these reasoning strategies through the project.

What You’ll Learn

How the ReasoningInterceptor intercepts model names, removes the +Strategy tag, and directs traffic to one of 16 agent classes
How 16 strategies divide into four families: sequential, branching, reflective, and meta —each representing a different reasoning approach and set of trade-offs
What each major strategy accomplishes in practice, focusing on implementation rather than theory
Which type of problem each strategy is best suited for, based on benchmark results from March 2026

The Interception Layer

Key insight: The ReasoningInterceptor is an interchangeable drop-in client for Ollama that analyzes the model name for a +Strategy tag and directs traffic to one of 16 cognitive agent classes while making no modifications to your pre-existing code.

Everything relies on a single template: add +Strategy to any Ollama model name.

Using ReasoningInterceptor as a drop-in replacement client; strategy routing can be enabled via model name tags (e.g., +tot). — Using `ReasoningInterceptor` as a drop-in replacement client

The image below illustrates the entire routing process from start to finish. The interceptor acts as a middleman between your code and Ollama, removes the +Strategy tag, and sends traffic to the correct agent class.

Diagram illustrating how the interceptor separates the base model from the Strategy tag and directs traffic to the corresponding agent class. — Illustrating how the interceptor separates the base model from the Strategy tag

agent_map contains over fifty-five aliases mapped to sixteen agent classes. For example, cot, chain_of_thought, and CoT all map to CotAgent, while mcts and monte_carlo map to MCTSAgent. Because the interceptor is a drop-in client for Ollama—supporting the same .generate() and .chat() APIs— existing LangChain pipelines, web UIs, and scripts can automatically gain reasoning capabilities by changing a single string in the model name.

Additionally, the interceptor can be used as a network proxy. Instead of pointing an Ollama compatible application at http://localhost:11434, direct it to http://localhost:8080 instead. Using a model name like gemma3:270m+CoT, the gateway will apply reasoning transparently.

Family 1: Sequential Strategies

Key insight: Sequential Strategies process problems in a linear chain, where each step feeds into the next. In benchmarks, CoT acheived 88.7% average accuracy, compared to 81.3% for standard generation on the same model and weights.

Each of the sixteen strategies fall into one of four families. The diagram below illustrates how they are grouped.

Categorization of the four Strategy families: sequential, branching, reflective, and meta. Each route leads to a specific type of reasoning agent. The fastest Sequential Strategies occupy the top-left quadrant while slower Branching strategies sacrifice speed for increased accuracy. — Categorization of the four strategy families

Sequential strategies are designed for high-speed processing with minimal latency. They are ideal for problems with discrete, sequential steps.

Chain of Thought (CoT)

Paper: Wei et al. (2022), “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”

Chain of Thought (CoT) is a prompting strategy in which the model generates intermediate reasoning steps before producing a final response. As noted in the original paper: prompting a model to produce these intermediate steps can significantly improve accuracy.

For example, standard prompting on GSM8K achieves 66.7% accuracy. With CoT prompting, this increases to 73.3%— a 10% relative improvement achieved through simple prompt design alone.

The following graphic illustrates how CoT chains appear in practice: a sequence of numbered steps, each building on the previous one.

Visual representation of CoT in operation: the model sequentially progresses through numbered steps (step 1…step n). Each subsequent step depends on previously generated steps. The numbering in the prompt is the only special instruction provided. — CoT in operation

In terms of implementation within CotAgent, the query is wrapped in a structured prompt:

Structured prompting enforces step-by-step reasoning in CoTAgent

Benchmark result for qwen3.5:9b (9.7B): CoT achieves 88.7% average accuracy, across GSM8K (math), MMLU (logic), and ARC-Challenge (reasoning), compared to 81.3% for standard generation. This seven-point gain in performance is attributable solely to structural prompts. Identical weights and temperatures were applied to both models.

Recommended usage: Math word problems; logic puzzles; any multi-step reasoning task where the individual steps are sequential and do not have branches.

Decomposed Prompting

Paper: Khot et al. (2022), “Decomposed Prompting: A Modular Approach for Solving Complex Tasks”

Decomposed prompting is an architectural module that splits large problems into smaller sub-problems. Each sub-problem is handled independently while carrying forward accumulated context from earlier steps. Once all sub-problems are processed, their outputs are synthesized into a final result. DecomposedAgent follows a three-phase process—decomposition, execution and synthesis—and propagating context throughout so that each step can build on prior results.

Recommended usage: Planning problems; trip itinerary generation; any problem where the ultimate answer consists of multiple distinguishable parts that may be individually addressed.

Note: Decomposed prompting achieved only 38.5% average accuracy in benchmark testing. This result requires context. GSM8K primarily evaluates arithmetic reasoning, where decomposing a problem like “what is 47 × 13 + 9?” introduces overhead without improving the model’s ability to compute the answer.

Decomposition is more effective for problems with genuinely separable components (trip planning, multi-section reports etc.), where each part benefits from focused attention. These strengths are not captured by the benchmark, and the results reflect that mismatch.

Least-to-Most Prompting

Paper: Zhou et al. (2022), “Least-to-Most Prompting Enables Complex Reasoning in Large Language Models”

Least-to-most prompting is a strategy that orders sub-questions from simplest to most complex, establishing prerequisite knowledge before tackling harder steps. Unlike decomposed prompting which generates arbitrary sub-problems, it enforces a deliberate progression where each step builds on the last. Knowledge is accumulated iteratively until the model reaches the final question.

Recommended usage: Questions with genuine prerequisites — e.g., “what is x?” before determining “how does x relate to y?”; educational style explanation sequences (“concept ladder”); tasks that require establishing foundational concepts before addressing more complex components.

Family 2: Branching Strategies

Key insight: Branching strategies explore multiple reasoning paths simultaneously and choose the best path. ToT scored 76.7% on GSM8K math, compared to 66.7% on GSM8K math with standard generation.

More LLM calls mean higher latency— but often better answers on hard problems. Take this into consideration when running all branching strategies.

Tree of Thoughts (ToT)

Paper: Yao et al. (2023), “Tree of Thoughts: Deliberate Problem Solving with Large Language Models”

ToT is a search-based methodology that evaluates numerous possible reasoning paths concurrently, selecting the best performing path as determined by evaluation metrics such as distance traveled or quality of intermediate solutions etc.

Similar to chess engines, ToT applies BFS through an expanding tree of possible solutions. The core idea is straightforward: generate multiple partial solutions, evaluate them, prune weaker candidates, and continue exploring the most promising branches.

Below is an illustration of how ToT generates and eliminates branches: green nodes represent surviving branches, while red nodes indicate those that have been eliminated. The final answer is derived from the highest scoring leaf node.

A key design decision is how branches are evaluated. Should the same model handle both generation and scoring, or should a stronger model be introduced as a judge? In these benchmarks, the same model was used for both roles, but this is an area worth experimenting with, depending on your accuracy and latency constraints.

Illustration of how to generate candidate branches at each level; score candidate branches between 0 & 1; prune low-scored candidates; continue exploring surviving high-scored candidates until all levels are exhausted and then generate final answer from most promising leaf node. — Generating candidate branches at each level

ToTAgent implements this as configurable by depth (default=3) and width (default=2 branches). At every level, the agent generates a set of candidate next steps, evaluates them using a scoring function, prunes low-scoring options, and expands the remaining candidates into the next level.

Tot achieved 76.7% accuracy—a 10% percent improvement over standard generation on GSM8K math problems. This performance comes at a cost: additional LLM calls are required at each step to evaluate candidate paths and their intermediate result, making it roughly 5-8x slower than CoT equivalent queries.

Recommended usage: Logic puzzles with multiple solution paths; strategic decision problems; tasks where multiple approaches can be explored and compared.

Self-Consistency (Majority Voting)

Paper: Wang et al. (2022), “Self-Consistency Improves Chain of Thought Reasoning in Language Models”.

Self-Consistency is a sampling method that generates multiple independent reasoning traces and selects a final answer through majority voting. Unlike standard prompting, it relies on sampling k diverse traces at a higher temperature to encourage variation. Each trace produces a candidate answer, and the most frequently occurring answer is selected as the final output.

The image below illustrates how both Self-Consistency and Monte Carlo Tree Search (MCTS) sample multiple reasoning paths, but differ fundamentally in how those paths are evaluated—majority voting versus UCB1-based exploration-exploitation balancing.

Left: Self-Consistency flowchart — sampling k independent traces & selecting most commonly occurring final answer via majority vote. Right: Monte Carlo Tree Search (MCTS) flowchart — sampling new paths through UCB1-based exploration/exploitation tradeoff balancing — both generate multiple possible answers — selection methodology differ significantly. — Self-Consistency vs MCTS comparison

ConsistencyAgent uses k=5 samples at temperature of 0.7 by default. It extracts final answers using regex-based pattern matching and selects the most frequent result via counter.most_common().

Self-Consistency matches CoT on both MMLU (96.7%) and GSM8K (76.7%). Its advantage lies in reliability rather than raw accuracy: majority voting across independent reasoning traces reduces the risk of single-trace errors propagating to the final answer.

Recommended usage: Factual question answering; multiple-choice style questions; problems where arriving at the correct answer via diverse reasoning paths is more important than inspecting a single reasoning trace.

Family 3: Reflective Strategies

Self-Reflection

Paper: Shinn et al. (2023), “Reflexion: Language Agents with Verbal Reinforcement Learning” — arXiv:2303.11366

Self-Reflection is a draft-critique-refine loop in which the model generates an initial answer, critiques it for errors, and then revises it. The Reflexion paper showed that this iterative process can meaningfully improve output quality, even without any gradient updates.

The image below shows all 3 reflective strategies side by side: Self-Reflection, Debate, and Refinement Loop.

Left: Self-Reflection drafts, critiques, and refines until the critique says “CORRECT.” Right: Debate puts PRO and CON agents against each other with a Judge scoring each round. Bottom: Refinement Loop uses a numeric quality gate (0.0–1.0) to decide when to stop iterating. — Reflective strategies comparison

SelfReflectionAgent runs a draft-critique-refine loop for up to 5 iterations, with early termination when the critique returns “CORRECT” in under 20 characters. If the critique is satisfied on an early pass, subsequent iterations are skipped. This approach helps keeps latency low for queries the model answers correctly on the initial pass.

Recommended usage: Creative writing, high-stakes technical explanations, anything where “good enough on the first try” is insufficient.

Adversarial Debate

Paper: Irving et al. (2018), “AI Safety via Debate”

Irving proposed debate as a mechanism for improving AI safety. Two agents present opposing arguments, and a judge (either a human or another LLM) evaluates their merits. The underlying premise is that that identifying flaw in weak arguments is often easier than constructing strong ones.

DebateAgent conducts multiple rounds of PRO and CON arguments, with a judge evaluating each exchange. Following all rounds, the strongest arguments from both sides are synthesized into a final answer that balances competing perspectives. Context is carried forward between rounds, enabling incremental refinement rather than redundant arguements.

Recommended usage: Controversial or ambiguous subjects; policy analysis; ethics and any subject matter requiring a balanced perspective.

Refinement Loop

Paper: Madaan et al. (2023), “Self-Refine: Iterative Refinement with Self-Feedback”

This paper describes a refinement loop similar to self-reflection, but instead of relying on a human-style critique to guide revisions, it uses a machine-based evaluation system with quantifiable quality metrics. These metrics determine whether further refinement is necessary. The loop terminates when a predefined quality metric is reached (> 0.9 by default) or when the maximum number of iterations is exceeded.

The five-stage complex refinement pipeline consists of sequential stages, each focused on a distinct type of critique: technical accuracy, structure, depth, examples, and polish.

Each stage targets a distinct aspect of quality, ensuring the model focuses exclusively on improving that dimension rather than attempting to optimize everything at once.

Recommended usage: Highly technical writing; documentation; blog posts, a scenario where production-quality output is required rather than simply a first draft.

Family 4: Cross-Domain and Meta Strategies

Key insight: Cross-domain strategies enable sharing knowledge among disciplines, while meta-strategies automatically route queries to the most appropriate reasoning technique without requiring manual selection.

Analogy-Based Reasoning

Paper: Gentner (1983), “Structure Mapping: A Theoretical Framework for Analogy”, Cognitive Science

Gentner’s structure-mapping theory proposes that analogical reasoning operates by identifying structural correspondences across domains, rather than relying on surface-level similarity. The AnalogicalAgent builds on this idea through three phases: (1) identify the underlying structure independent of domain specifics, (2) generate analogous solutions from different domains that share that structure, (3) select the most effective analogy and apply its solution approach.

This process reduces reliance on memorized patterns. By focusing on underlying structure, the model learns why a solution works, rather than simply recalling what worked before.

Recommended usage: Solving problems that are structurally similar to prior ones, even if they differ superficially; transferring knowledge across domains; explaining complex concepts through analogy.

Socratic Questioning

Paper: Paul & Elder (2007), “The Art of Socratic Questioning”

The Socratic Method: Do not answer the question directly. Instead, ask follow-up questions that reduce ambiguity in the solution space.

SocraticAgent repeatedly asks questions and receives model responses, continuing until it reaches a limit of five question-response exchanges. It then synthesizes the collected information into a final answer. A deduplication or normalization step helps prevent repeated queries that differ only in wording.

Recommended for: Philosophy; ethics; deep technical knowledge; any field requiring the model to “know” something as opposed to merely answering it.

ReAct (Reason + Act)

Paper: Yao et al. (2022), “ReAct: Synergizing Reasoning and Acting in Language Models”

ReAct is a conceptual framework that interweaves reasoning steps with tool invocations, allowing the model to ground its thinking in external information. In practice, the model decides what action to take, calls a tools such as a web search engine, examines the result, updates its reasoning, and repeats the cycle until it reaches a satisfactory answer. Current tools include web scraping, accessing Wikipedia via an API call, and a calculator interface, with mock-ups available for off-line execution scenarios.

Using ReAct acheived 70.0% accuracy on ARC-Challenge (Science Reasoning). While not the highest on this particular benchmark, it enabled tool use for the LLM and allowed it to search for required information on the Internet.

Recommended usage: Fact-checking; current events queries; mathematical calculations; tasks where access to grounded, external information is important.

Auto Router: MetaReasoningAgent

Key insight: A single LLM invocation allows MetaReasoningAgent to classify each input into one of eleven categories and route it to the most appropriate strategy, without human intervention.

All sixteen strategies depend on selecting the appropriate strategy for a given task. By removing this requirement, MetaReasoningAgent eliminates the need for manual selection.

The diagram below shows how each category maps to its corresponding strategy.

Classification occurs using a single LLM invocation returning CATEGORY, CONFIDENCE, and REASON. — MetaReasoningAgent classification diagram

MetaReasoningAgent instantiates the selected strategy class and passes control to it, along with all event objects for visualization.

To use this capability, specify a model such as gemma3:270m+meta or gemma3:270m+auto.

In practice, routing is generally intuitive: math problems are directed to CoT, logic puzzles to ToT, philosophical questions to Socratic Questioning, and controversial topics to Adversarial Debate.

The trade-off is reduced control over strategy-specific hyperparameters in exchange for automatic routing aligned with the problem type.

What Strategy Should You Pick? Benchmark Results (March 2026)

Key insight: CoT performs best on average (88.7%) across diverse tasks. ReAct excels when tool use is available (70.0% on ARC-Challenge). ToT and Self-Consistency tie on GSM8K math at 76.7%.

These results are based on 4,200 evaluations across 11 strategies using qwen3.5:9b, collected as of March 2026. All 16 strategies are implemented and production-ready. However, the benchmarks shown below focus on the 11 that produce a single extractable answer. The remaining five are generation-focused and not suited to multiple-choice evaluation.

The heat map and bar chart below provide a complete view of the results.

Left: accuracy heatmap across GSM8K, MMLU, and ARC-Challenge for each strategy. Right: average accuracy bar chart. CoT wins overall at 88.7%. — Benchmark results heatmap and bar chart

The short version: CoT wins on average across diverse tasks. Self-Consistency and ToT beat it on specific math benchmarks. ReAct dominates on factual/science tasks. Self-Reflection and Refinement Loop are not well captured by these benchmarks, as they primarily improve generation quality rather than multiple-choice accuracy.

For most queries, start with +cot. If you’re solving logic puzzles or planning problems, try +tot. If you need factually grounded responses, use +react. If you need polished, high-quality output rather than a quick answer, use +refinement. When in doubt, +meta will route they query automatically.

In my experience building agent-reasoning, the most surprising finding is how much prompt structure alone can improve performance. For example, qwen3.5:9b improves from 81.3% to 88.7% average accuracy simply by prompting it to produce numbered reasoning steps.

As of March 2026, all 16 strategies are production-ready and have been evaluated across 4,200 benchmark runs.

You can find the repository here. Install with pip install agent-reasoning or uv add agent-reasoning. The commands to get started:

Getting started commandsInstallation and launching agent-reasoning in seconds to access a TUI with 16 reasoning agents. — Getting started commands

The TUI provides a 16-agent sidebar, live streaming, and a step-through debugger. Arena mode runs all 16 agents simultaneously on the same query in a 4×4 grid.

If this is useful, a GitHub star is always appreciated.

Frequently Asked Questions

Do I need to modify my existing code to use agent-reasoning?

No. The interceptor is a drop-in replacement for the Ollama client. Just change the model name string by appending +strategy (e.g., gemma3:270m+cot) and the interceptor handles everything else. Existing LangChain pipelines, web UIs, and scripts work without any other changes.

Which strategy should I start with?

Start with +cot (Chain of Thought). It scored the highest average accuracy (88.7%) across our benchmarks and adds minimal latency. If you are unsure, use +meta and let the auto-router pick the best strategy for you.

Why were only 11 of the 16 strategies benchmarked?

The benchmarks (GSM8K, MMLU, ARC-Challenge) measure multiple-choice accuracy, which works well for strategies that produce a single extractable answer. The remaining five strategies are generation-focused (e.g., Refinement Loop, MCTS) and their strengths in output quality are not captured by multiple-choice evaluations. All 16 strategies are fully implemented and production-ready.

Can I use this with models other than Ollama-served models?

Currently the interceptor targets the Ollama API. Since it exposes the same .generate() and .chat() endpoints, any Ollama-compatible client works out of the box. Support for additional inference backends is on the roadmap.

How much slower are branching strategies compared to CoT?

ToT is roughly 5-8x slower than CoT because it generates and evaluates multiple candidate branches at each level. Self-Consistency (k=5 samples) adds similar overhead. For latency-sensitive applications, stick with sequential strategies (CoT, Least-to-Most) and reserve branching strategies for problems where accuracy matters more than speed.

I’m Nacho Martinez, Data Scientist at Oracle. I build open-source AI projects and write about making language models reason better. Find me on GitHub and LinkedIn, or visit the Oracle AI Developer page for more resources.

16 Ways to make a Small Language Model think bigger

Key Takeaways

What You’ll Learn

The Interception Layer