Key Takeaways

  • Agent Reasoning is an open-source reasoning layer that adds planning, deduction, and self-correction to any Ollama-served LLM (e.g., gemma3, llama3), via plug-and-play Python or a proxy server.
  • Multiple proven reasoning strategies built-in (CoT, Self-Consistency, ToT, ReAct, Self-Reflection, Decomposition, Refinement) with a guided “start simple” path.
  • Practical tooling for teams: interactive CLI/TUI, Python API, and an Ollama-compatible gateway so existing apps gain reasoning without code changes.
  • Clear benchmark guidance: CoT delivers the best average accuracy; ToT shines for multi-step logic; ReAct leads when tools (search, calculator) matter.

Implementing Cognitive Problem-Solving in Open Source Models

From Nacho Martinez, Data Scientist Advocate at Oracle (and author of the A2A-based Multi-Agent RAG system) comes an open-source reasoning layer that can enable any open-source Large Language Model (LLM) such as gemma3 or llama3 to perform complex planning, logical deduction and self-correction. The layer wraps these models in a cognitive architecture built based on key research papers (CoT, ToT and ReAct).

We call this Agent Reasoning, and it is available open-source in this GitHub repository, alongside a Jupyter notebook.

Features of Agent Reasoning

  • Plug & Play: Use via Python Class or as a Network Proxy.
  • Model Agnostic: Works with any model served by Ollama.
  • Advanced Architectures:
    • Chain-of-Thought (CoT) & Self-Consistency: Implements Majority Voting (k samples) with temperature sampling.
    • Tree of Thoughts (ToT): BFS strategy with robust heuristic scoring and pruning.
    • ReAct (Reason + Act): Real-time tool usage (Web Search via scraping, Wikipedia API, Calculator) with fallback/mock capabilities. External grounding implemented.
    • Self-Reflection: Dynamic multi-turn Refinement Loop (Draft -> Critique -> Improve).
    • Decomposition & Least-to-Most: Planning and sub-task execution.
    • Refinement Loop: Score-based iterative improvement (Generator → Critic → Refiner) until quality threshold met.
    • Complex Refinement Pipeline: 5-stage optimization (Technical Accuracy → Structure → Depth → Examples → Polish).

Interactive Jupyter Notebook

We prepared an interactive Jupyter notebook to demonstrate the capabilities of agent reasoning.

This is a comprehensive demo covering all reasoning strategies (CoT, ToT, ReAct, Self-Reflection) with benchmarks and comparisons.

Architectures in Detail

For most users, start with Chain-of-Thought (CoT) — it has the best average accuracy and lowest latency cost. Use Self-Consistency when correctness is critical and you can afford 3–5× more inference time. Avoid ToT for knowledge-retrieval tasks (it underperforms baseline on MMLU) and reserve it for multi-step planning or logic puzzles.

ArchitectureDescriptionBest ForPapers
Chain-of-ThoughtStep-by-step reasoning prompt injection.Math, Logic, ExplanationsWei et al. (2022)
Self-ReflectionDraft -> Critique -> Refine loop.Creative Writing, High AccuracyShinn et al. (2023)
ReActInterleaves Reasoning and Tool Usage.Fact-checking, CalculationsYao et al. (2022)
Tree of ThoughtsExplores multiple reasoning branches (BFS/DFS).Complex Riddles, StrategyYao et al. (2023)
DecomposedBreaks complex queries into sub-tasks.Planning, Long-form answersKhot et al. (2022)
Recursive (RLM)Uses Python REPL to recursively process prompt variables.Long-context processingAuthor et al. (2025)
Refinement LoopGenerator → Critic (0.0-1.0 score) → Refiner iterative loop.Technical Writing, Quality ContentInspired by Madaan et al. (2023)
Complex Refinement5-stage pipeline: Accuracy → Clarity → Depth → Examples → Polish.Long-form Articles, DocumentationMulti-stage refinement architecture

Accuracy Benchmarks

You can evaluate reasoning strategies against standard NLP datasets to measure accuracy improvements. The benchmark system includes embedded question sets from 4 standard datasets.

To run an accuracy benchmark:

accuracy benchmark evaluate reasoning strategies

Or using the Python API:

accuracy benchmark evaluate reasoning strategies codeblock

Charts are auto-generated after each run and save to benchmarks/charts/.

DatasetCategoryQuestionsFormatReference
GSM8KMath Reasoning30Open-ended numberCobbe et al. (2021)
MMLUKnowledge (57 subjects)30Multiple choice (A-D)Hendrycks et al. (2021)
ARC-ChallengeScience Reasoning25Multiple choice (A-D)Clark et al. (2018)
HellaSwagCommonsense20Multiple choice (A-D)Zellers et al. (2019)

The following are the results of a full evaluation across all 11 strategies:

StrategyGSM8KMMLUARC-CHellaSwagAvg
Standard (baseline)66.7%90.0%92.0%90.0%84.7%
Chain of Thought73.3%96.7%88.0%90.0%87.0%
Tree of Thoughts76.7%63.3%76.0%90.0%76.5%
ReAct63.3%86.7%96.0%90.0%84.0%
Self-Reflection66.7%90.0%88.0%90.0%83.7%
Self-Consistency76.7%96.7%92.0%66.3%
Decomposed10.0%60.0%84.0%38.5%

Key findings:

  • CoT achieves the highest average accuracy (87.0%), outperforming Standard on GSM8K (+6.6%) and MMLU (+6.7%)
  • Self-Consistency ties CoT on MMLU (96.7%) and GSM8K (76.7%) through majority voting
  • ToT excels on GSM8K math (76.7%, +10% over Standard) through branch exploration
  • ReAct achieves the highest ARC-Challenge score (96.0%) via tool-augmented reasoning

Accuracy statistics

This is the accuracy heat map per-strategy:

accuracy heat map per-strategy

This is the average accuracy by strategy:

average accuracy by strategy across 4 dataset for gemma3:latest

Benchmarks

Benchmarks charts are auto-generated after every benchmark run.

For a complete listing of sample output benchmarks (response latency, throughput etc.) please refer to the Agent Reasoning GitHub repository.

Quick start (3 commands)

uv sync && ollama pull gemma3:270m && uv run agent-reasoning

Installation

One-command, single-step install

curl -fsSL https://raw.githubusercontent.com/jasperan/agent-reasoning/main/install.sh | bash

You can also install agent-reasoning using either PyPi or directly from source.

Using PyPi

From Source using uv

Development

Configuring the large language model (LLM)

We use Ollama as an example for this procedure.

Ollama must be running locally, or you can connect to a remote Ollama instance.

ollama pull gemma3:270m    # Tiny model for quick testing
ollama pull gemma3:latest  # Full model for quality results

Configuring the remote Ollama endpoint

If you don’t have Ollama installed locally, you can connect to a remote Ollama instance. Configuration is stored in config.yaml in the root directory of the repository.

Option 1: Interactive CLI configuration

agent-reasoning
# Select "Configure Endpoint" from the menu

Option 2: Server CLI Argument

agent-reasoning-server --ollama-host http://192.168.1.100:11434

Option 3: Direct Config File

Copy the example config and edit it:

cp config.yaml.example config.yaml

Or create config.yaml in the project root:

ollama:
  host: http://192.168.1.100:11434

Option 4: Python API

Usage

1. Interactive CLI

Use the rich CLI to access all agents, comparisons and benchmarks.

  • Timing Metrics: Every response shows TTFT, total time, tokens/sec
  • Session History: All chats auto-saved to data/sessions/ with export to markdown
  • Head-to-Head: Compare any two strategies side-by-side in parallel
  • Agent Info: Built-in strategy guide with descriptions and use cases
  • Benchmark Charts: Auto-generate PNG visualizations of benchmark results

Setup

Shortcuts

The CLI also provides useful shortcuts:

Interactive experience

2. Terminal UI

You can also use a Go-based terminal interface with a split-panel layout and arena grid view.

  • Split layout: agent sidebar + chat panel
  • Arena mode: 3×3 grid showing all agents running in parallel
  • Real-time streaming with cancellation support

The TUI automatically starts the reasoning server on launch. Requires Go 1.18+.

Keybindings for TUI

Chat View

The default chat view is a split-pane layout with a 16-agent sidebar, chat panel with live streaming, and a metrics bar showing TTFT, tokens/sec, and token count in real-time.

Press v to toggle structured visualization mode. Instead of raw text, you see the agent’s reasoning process rendered live: tree diagrams for ToT, swimlanes for ReAct, vote tallies for Consistency, score gauges for Refinement, and more.

Press p to open the hyperparameter tuner. Adjust ToT width/depth, Consistency samples, Refinement score thresholds, and other agent parameters before running a query.

Press ? to invoke the strategy advisor. The MetaReasoningAgent analyzes your query and recommends the best strategy.

Modes of interaction

Arena Mode prompts all 16 agents to race simultaneously on the same query displayed using a 4×4 grid; a leaderboard bar updates as each agents finish:

Head-to-Head Duel prompts two agents to compete 1-1 on the same query.

There are plenty of other features to try, such as:

  • the Step-Through Debugger which enables pausing the agent between LLM calls and inspecting intermediate state
  • the Benchmark Dashboard which reads existing JSON benchmark files
  • the Session Browser which enables search and re-running of past conversations, with filtering options
  • the Agent Guide, which contains reference cards for all 16 agents, covering best-for, parameters, trade-offs, and research reference. Pressing Enter on any card initiates a chat with the agent.

3. Python API (for developers)

Use the ReasoningInterceptor as a drop-in replacement for your LLM client.

Using agents directly:

Using refinement agents for quality control:

4. Reasoning Gateway Server

Run a proxy server that impersonates Ollama. This allows any Ollama-compatible app, such as LangChain or Web UIs, to gain reasoning capabilities without any code changes whatsoever.

Then configure your app:

  • Base URLhttp://localhost:8080
  • Modelgemma3:270m+cot (or +tot, +react, etc.)

API Endpoints

Troubleshooting

  • Model Not Found: Ensure you have pulled the base model (ollama pull gemma3:270m).
  • Timeout / Slow: ToT and Self-Reflection make multiple calls to the LLM. With larger models (Llama3 70b), this can take time.
  • Hallucinations: The default demo uses gemma3:270m which is extremely small and prone to logic errors. Switch to gemma2:9b or llama3 for robust results.

Extending the system further

You can add additional reasoning strategies.

  1. Create a class in src/agent_reasoning/agents/ inheriting from BaseAgent.
  2. Implement the stream(self, query) method.
  3. Register it in AGENT_MAP in src/agent_reasoning/interceptor.py.

Conclusion

Thank you for reading, and we look forward to seeing what you build using Agent Reasoning!

Frequently Asked Questions (FAQs)

When should I use each strategy?

Start with Chain-of-Thought for best accuracy/latency trade-off; use Self-Consistency when correctness is critical; reserve Tree of Thoughts for complex multi-step reasoning; pick ReAct for fact-checks or calculations.

Do I need a specific model?

No. It’s model-agnostic for any model served by Ollama. Quality improves with larger models (e.g., gemma2:9b, llama3 vs tiny 270m).

How hard is setup?

Three-command quick start, one-line install script, and ready-to-run demos in a Jupyter notebook. A proxy lets existing Ollama apps adopt reasoning by just changing the base URL/model name.

How do I evaluate results?

Built-in benchmarks (GSM8K, MMLU, ARC-Challenge, HellaSwag) auto-generate charts, with side-by-side strategy comparisons and session histories for review.