Agent Reasoning: The Thinking Layer

Key Takeaways

Agent Reasoning is an open-source reasoning layer that adds planning, deduction, and self-correction to any Ollama-served LLM (e.g., gemma3, llama3), via plug-and-play Python or a proxy server.
Multiple proven reasoning strategies built-in (CoT, Self-Consistency, ToT, ReAct, Self-Reflection, Decomposition, Refinement) with a guided “start simple” path.
Practical tooling for teams: interactive CLI/TUI, Python API, and an Ollama-compatible gateway so existing apps gain reasoning without code changes.
Clear benchmark guidance: CoT delivers the best average accuracy; ToT shines for multi-step logic; ReAct leads when tools (search, calculator) matter.

Implementing Cognitive Problem-Solving in Open Source Models

From Nacho Martinez, Data Scientist Advocate at Oracle (and author of the A2A-based Multi-Agent RAG system) comes an open-source reasoning layer that can enable any open-source Large Language Model (LLM) such as gemma3 or llama3 to perform complex planning, logical deduction and self-correction. The layer wraps these models in a cognitive architecture built based on key research papers (CoT, ToT and ReAct).

We call this Agent Reasoning, and it is available open-source in this GitHub repository, alongside a Jupyter notebook.

Features of Agent Reasoning

Plug & Play: Use via Python Class or as a Network Proxy.
Model Agnostic: Works with any model served by Ollama.
Advanced Architectures:
- Chain-of-Thought (CoT) & Self-Consistency: Implements Majority Voting (k samples) with temperature sampling.
- Tree of Thoughts (ToT): BFS strategy with robust heuristic scoring and pruning.
- ReAct (Reason + Act): Real-time tool usage (Web Search via scraping, Wikipedia API, Calculator) with fallback/mock capabilities. External grounding implemented.
- Self-Reflection: Dynamic multi-turn Refinement Loop (Draft -> Critique -> Improve).
- Decomposition & Least-to-Most: Planning and sub-task execution.
- Refinement Loop: Score-based iterative improvement (Generator → Critic → Refiner) until quality threshold met.
- Complex Refinement Pipeline: 5-stage optimization (Technical Accuracy → Structure → Depth → Examples → Polish).

Interactive Jupyter Notebook

We prepared an interactive Jupyter notebook to demonstrate the capabilities of agent reasoning.

This is a comprehensive demo covering all reasoning strategies (CoT, ToT, ReAct, Self-Reflection) with benchmarks and comparisons.

Architectures in Detail

For most users, start with Chain-of-Thought (CoT) — it has the best average accuracy and lowest latency cost. Use Self-Consistency when correctness is critical and you can afford 3–5× more inference time. Avoid ToT for knowledge-retrieval tasks (it underperforms baseline on MMLU) and reserve it for multi-step planning or logic puzzles.

Architecture	Description	Best For	Papers
Chain-of-Thought	Step-by-step reasoning prompt injection.	Math, Logic, Explanations	Wei et al. (2022)
Self-Reflection	Draft -> Critique -> Refine loop.	Creative Writing, High Accuracy	Shinn et al. (2023)
ReAct	Interleaves Reasoning and Tool Usage.	Fact-checking, Calculations	Yao et al. (2022)
Tree of Thoughts	Explores multiple reasoning branches (BFS/DFS).	Complex Riddles, Strategy	Yao et al. (2023)
Decomposed	Breaks complex queries into sub-tasks.	Planning, Long-form answers	Khot et al. (2022)
Recursive (RLM)	Uses Python REPL to recursively process prompt variables.	Long-context processing	Author et al. (2025)
Refinement Loop	Generator → Critic (0.0-1.0 score) → Refiner iterative loop.	Technical Writing, Quality Content	Inspired by Madaan et al. (2023)
Complex Refinement	5-stage pipeline: Accuracy → Clarity → Depth → Examples → Polish.	Long-form Articles, Documentation	Multi-stage refinement architecture

Accuracy Benchmarks

You can evaluate reasoning strategies against standard NLP datasets to measure accuracy improvements. The benchmark system includes embedded question sets from 4 standard datasets.

To run an accuracy benchmark:

accuracy benchmark evaluate reasoning strategies

Or using the Python API:

accuracy benchmark evaluate reasoning strategies codeblock

Charts are auto-generated after each run and save to benchmarks/charts/.

Dataset	Category	Questions	Format	Reference
GSM8K	Math Reasoning	30	Open-ended number	Cobbe et al. (2021)
MMLU	Knowledge (57 subjects)	30	Multiple choice (A-D)	Hendrycks et al. (2021)
ARC-Challenge	Science Reasoning	25	Multiple choice (A-D)	Clark et al. (2018)
HellaSwag	Commonsense	20	Multiple choice (A-D)	Zellers et al. (2019)

The following are the results of a full evaluation across all 11 strategies:

Strategy	GSM8K	MMLU	ARC-C	HellaSwag	Avg
Standard (baseline)	66.7%	90.0%	92.0%	90.0%	84.7%
Chain of Thought	73.3%	96.7%	88.0%	90.0%	87.0%
Tree of Thoughts	76.7%	63.3%	76.0%	90.0%	76.5%
ReAct	63.3%	86.7%	96.0%	90.0%	84.0%
Self-Reflection	66.7%	90.0%	88.0%	90.0%	83.7%
Self-Consistency	76.7%	96.7%	92.0%	—	66.3%
Decomposed	10.0%	60.0%	84.0%	—	38.5%

Key findings:

CoT achieves the highest average accuracy (87.0%), outperforming Standard on GSM8K (+6.6%) and MMLU (+6.7%)
Self-Consistency ties CoT on MMLU (96.7%) and GSM8K (76.7%) through majority voting
ToT excels on GSM8K math (76.7%, +10% over Standard) through branch exploration
ReAct achieves the highest ARC-Challenge score (96.0%) via tool-augmented reasoning

Accuracy statistics

This is the accuracy heat map per-strategy:

This is the average accuracy by strategy:

average accuracy by strategy across 4 dataset for gemma3:latest

Benchmarks

Benchmarks charts are auto-generated after every benchmark run.

For a complete listing of sample output benchmarks (response latency, throughput etc.) please refer to the Agent Reasoning GitHub repository.

Quick start (3 commands)

uv sync && ollama pull gemma3:270m && uv run agent-reasoning

Installation

One-command, single-step install

curl -fsSL https://raw.githubusercontent.com/jasperan/agent-reasoning/main/install.sh | bash

You can also install agent-reasoning using either PyPi or directly from source.

Using PyPi

From Source using uv

Development

Configuring the large language model (LLM)

We use Ollama as an example for this procedure.

Ollama must be running locally, or you can connect to a remote Ollama instance.

ollama pull gemma3:270m    # Tiny model for quick testing
ollama pull gemma3:latest  # Full model for quality results

Configuring the remote Ollama endpoint

If you don’t have Ollama installed locally, you can connect to a remote Ollama instance. Configuration is stored in config.yaml in the root directory of the repository.

Option 1: Interactive CLI configuration

agent-reasoning
# Select "Configure Endpoint" from the menu

Option 2: Server CLI Argument

agent-reasoning-server --ollama-host http://192.168.1.100:11434

Option 3: Direct Config File

Copy the example config and edit it:

cp config.yaml.example config.yaml

Or create config.yaml in the project root:

ollama:
  host: http://192.168.1.100:11434

Option 4: Python API

Usage

1. Interactive CLI

Use the rich CLI to access all agents, comparisons and benchmarks.

Timing Metrics: Every response shows TTFT, total time, tokens/sec
Session History: All chats auto-saved to data/sessions/ with export to markdown
Head-to-Head: Compare any two strategies side-by-side in parallel
Agent Info: Built-in strategy guide with descriptions and use cases
Benchmark Charts: Auto-generate PNG visualizations of benchmark results

Setup

Shortcuts

The CLI also provides useful shortcuts:

Interactive experience

2. Terminal UI

You can also use a Go-based terminal interface with a split-panel layout and arena grid view.

Split layout: agent sidebar + chat panel
Arena mode: 3×3 grid showing all agents running in parallel
Real-time streaming with cancellation support

The TUI automatically starts the reasoning server on launch. Requires Go 1.18+.

Keybindings for TUI

Chat View

The default chat view is a split-pane layout with a 16-agent sidebar, chat panel with live streaming, and a metrics bar showing TTFT, tokens/sec, and token count in real-time.

Press v to toggle structured visualization mode. Instead of raw text, you see the agent’s reasoning process rendered live: tree diagrams for ToT, swimlanes for ReAct, vote tallies for Consistency, score gauges for Refinement, and more.

Press p to open the hyperparameter tuner. Adjust ToT width/depth, Consistency samples, Refinement score thresholds, and other agent parameters before running a query.

Press ? to invoke the strategy advisor. The MetaReasoningAgent analyzes your query and recommends the best strategy.

Modes of interaction

Arena Mode prompts all 16 agents to race simultaneously on the same query displayed using a 4×4 grid; a leaderboard bar updates as each agents finish:

Head-to-Head Duel prompts two agents to compete 1-1 on the same query.

There are plenty of other features to try, such as:

the Step-Through Debugger which enables pausing the agent between LLM calls and inspecting intermediate state
the Benchmark Dashboard which reads existing JSON benchmark files
the Session Browser which enables search and re-running of past conversations, with filtering options
the Agent Guide, which contains reference cards for all 16 agents, covering best-for, parameters, trade-offs, and research reference. Pressing Enter on any card initiates a chat with the agent.

3. Python API (for developers)

Use the ReasoningInterceptor as a drop-in replacement for your LLM client.

Using agents directly:

Using refinement agents for quality control:

4. Reasoning Gateway Server

Run a proxy server that impersonates Ollama. This allows any Ollama-compatible app, such as LangChain or Web UIs, to gain reasoning capabilities without any code changes whatsoever.

Then configure your app:

Base URL: http://localhost:8080
Model: gemma3:270m+cot (or +tot, +react, etc.)

API Endpoints

Troubleshooting

Model Not Found: Ensure you have pulled the base model (ollama pull gemma3:270m).
Timeout / Slow: ToT and Self-Reflection make multiple calls to the LLM. With larger models (Llama3 70b), this can take time.
Hallucinations: The default demo uses gemma3:270m which is extremely small and prone to logic errors. Switch to gemma2:9b or llama3 for robust results.

Extending the system further

You can add additional reasoning strategies.

Create a class in src/agent_reasoning/agents/ inheriting from BaseAgent.
Implement the stream(self, query) method.
Register it in AGENT_MAP in src/agent_reasoning/interceptor.py.

Conclusion

Thank you for reading, and we look forward to seeing what you build using Agent Reasoning!

Frequently Asked Questions (FAQs)

When should I use each strategy?

Start with Chain-of-Thought for best accuracy/latency trade-off; use Self-Consistency when correctness is critical; reserve Tree of Thoughts for complex multi-step reasoning; pick ReAct for fact-checks or calculations.

Do I need a specific model?

No. It’s model-agnostic for any model served by Ollama. Quality improves with larger models (e.g., gemma2:9b, llama3 vs tiny 270m).

How hard is setup?

Three-command quick start, one-line install script, and ready-to-run demos in a Jupyter notebook. A proxy lets existing Ollama apps adopt reasoning by just changing the base URL/model name.

How do I evaluate results?

Built-in benchmarks (GSM8K, MMLU, ARC-Challenge, HellaSwag) auto-generate charts, with side-by-side strategy comparisons and session histories for review.