Here you find a collection of material (books, papers, blog-posts etc.) related to reasoning and cognition in AI systems. Specifically we want to cover agents, cognitive architectures, general problem solving strategies and self-improvement.
The term "System 2" in the page title refers to the slower, more deliberative, and more logical mode of thought as described by Daniel Kahneman in his book Thinking, Fast and Slow.
You know a great resource we should add? Please see How to contribute.
(looking for additional links & articles and summaries)
- SOAR (State, Operator, And Result) by John Laird, Allen Newell, and Paul Rosenbloom
- ACT-R (Adaptive Control of Thought-Rational) by John Anderson at CMU
- SPAUN (Semantic Pointer Architecture Unified Network) by Chris Eliasmith at Waterloo, SPAUN 2.0 by Feng-Xuan Choo
- ART (Adaptive resonance theory) by Stephen Grossberg and Gail Carpenter
- CLARION (Connectionist Learning with Adaptive Rule Induction ON-line) by Ron Sun
- EPIC (Executive Process/Interactive Control) by David Kieras and David Meyer
- LIDA (Learning Intelligent Distribution Agent) by Stan Franklin, 2015 Paper
- Sigma by Paul Rosenbloom
- OpenCog by Ben Goertzel
- NARS (Non-Axiomatic Reasoning System) by Pei Wang
- Icarus by Pat Langley
- MicroPsi by Joscha Bach
- Thousand Brains Theory & HTM (Hierarchical Temporal Memory) by Jeff Hawkins
- SPH (Sparse Predictive Hierarchie) by Eric Laukien
- Leabra (Local, Error-driven and Associative, Biologically Realistic Algorithm), 2016 Paper by Randall O'Reilly
- CogNGen (COGnitive Neural GENerative system) by Alexander Ororbia and Mary Alexandria Kelly, see also here and here
- KIX (KIX: A Metacognitive Generalization Framework) by A. Kumar and Paul Schrater
- ACE (Autonomous Cognitive Entity) by David Shapiro et al., gh: daveshap/ACE_Framework
- Iterative Updating of Working Memory by Jared Reser, website, Video
- Nov 2024 LLaVA-o1: Let Vision Language Models Reason Step-by-Step
- ReAct: Synergizing Reasoning and Acting in Language Models
- SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
- The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery, gh: SakanaAI/AI-Scientist
- OpenDevin: An Open Platform for AI Software Developers as Generalist Agents
- TextGrad: Automatic "Differentiation" via Text
- Trace is the New AutoDiff -- Unlocking Efficient Optimization of Computational Workflows
- Agentless: Demystifying LLM-based Software Engineering Agents
- Competition-Level Code Generation with AlphaCode
- AI Agents That Matter
- Sibyl: Simple yet Effective Agent Framework for Complex Real-world Reasoning
- Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents
- Self-Rewarding Language Models
- ArchCode: Incorporating Software Requirements in Code Generation with Large Language Models
- MedAgent-Zero: Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents
- Cognitive Architectures for Language Agents
- Large Language Models Can Self-Improve At Web Agent Tasks
- AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation
- A Prefrontal Cortex-inspired Architecture for Planning in Large Language Models
- CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization
- DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search
- GPT-Swarm Language Agents as Optimizable Graphs
- Survey: Reasoning with Large Language Models, a Survey (Jul 2024)
- Survey: From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future (Aug 2024)
- ADAS: Automated Design of Agentic Systems
- IDEA:Enhancing the Rule Learning Ability of Language Agents through Induction, Deduction, and Abduction
- LAW: Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning
- GenRM: Generative Verifiers: Reward Modeling as Next-Token Prediction
- Perspective: Towards Building Specialized Generalist AI with System 1 and System 2 Fusion
- CodeAct: Executable Code Actions Elicit Better LLM Agents
- PLANSEARCH: Planning In Natural Language Improves LLM Search For Code Generation
- LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench
- Thinking LLMs: General Instruction Following with Thought Generation
- Agent S: An Open Agentic Framework that Uses Computers Like a Human
- 30 Dec 2024 Aviary: training language agents on challenging scientific tasks - expert iteration & rejection sampling
- 25 Dec 2024 HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
- 18 Dec 2024 A Survey on LLM Inference-Time Self-Improvement
- 02 Dec 2024 Mastering Board Games by External and Internal Planning with Language Models
- 14 Oct 2024 TPO: Thinking LLMs: General Instruction Following with Thought Generation
- Thinking LLMs: General Instruction Following with Thought Generation
- Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
- Chain of Thought Imitation with Procedure Cloning
- Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning
- Self-Discover: Large Language Models Self-Compose Reasoning Structures
- TRICE: Training Chain-of-Thought via Latent-Variable Inference
- Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
- Self-Taught Reasoner: STaR: Bootstrapping Reasoning With Reasoning
- Self-Notes: Learning to Reason and Memorize with Self-Notes
- From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step
- LaTRO: Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding, code
- DeepSeek R-1: (https://chat.deepseek.com/)
- OpenR: Technical Report, Project Page, code: openreasoner/openr
- GAIR-NLP/O1-Journey, O1 Replication Journey: Strategic Progress Report - Part 1
- OpenSource-O1/Open-O1
- bklieger-groq/g1: Using Llama-3.1 70b on Groq to create o1-like reasoning chains
- ack-sec/toyberry: Atlas Reasoning System (Toyberry)
- Blog: Reverse engineering OpenAI’s o1 by Nathan Lambert
- 02 Jan 2025 Process Reinforcement through Implicit Rewards - implicit PRM, gh: PRIME-RL/PRIME
- Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning
- Solving math word problems with process- and outcome-based feedback
- Training Verifiers to Solve Math Word Problems
- RATIONALYST: Pre-training Process-Supervision for Improving Reasoning
- 14 Dec 2023, Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
- 20 Dec 2024 OREO: Offline Reinforcement Learning for LLM Multi-Step Reasoning
- 11 Oct 2024 DQO: Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization
- 02 Oct 2024 RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning
- Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning
- ReFT: Reasoning with Reinforced Fine-Tuning
- ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback
- Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
- Improve Mathematical Reasoning in Language Models by Automated Process Supervision
- Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
- rStar: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers
- LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios, code: opendilab/LightZero
- MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model, open-source impl: werner-duvaud/muzero-general
- Voyager: An Open-Ended Embodied Agent with Large Language Models
- JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models
- Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks
- STEVE Series: Step-by-Step Construction of Agent Systems in Minecraft
- Inference survey: From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
- Large Language Monkeys: Scaling Inference Compute with Repeated Sampling, code, blog
- AlphaCode 2 Technical Report
- GameNGen: Diffusion Models Are Real-Time Game Engines, project page
- A Path Towards Autonomous Machine Intelligence
- GAIA-1: A Generative World Model for Autonomous Driving
- Latent space world-models: Dreamer, V2, V3, DayDreamer
- World Models, web: project page
- Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models
- HYSYNTH: Context-Free LLM Approximation for Guiding Program Synthesis
- SymbolicAI: A framework for logic-based approaches combining generative models and solvers, Library: ExtensityAI/symbolicai
- DreamCoder: Growing generalizable, interpretable knowledge with wake-sleep Bayesian program learning
- A Neuro-vector-symbolic Architecture for Solving Raven's Progressive Matrices
- Reasoning proofs generated by Prolog: Neuro-Symbolic Integration Brings Causal and Reliable Reasoning Proofs, Code
- VerityMath: Advancing Mathematical Reasoning by Self-Verification Through Unit Consistency, Code
- AlphaGeometry: Solving olympiad geometry without human demonstrations
- Hologram Reasoning for Solving Algebra Problems with Geometry Diagrams
- Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving
- Surveys:
- (Jul 2024) The Prompt Report: A Systematic Survey of Prompting Techniques
- (Feb 2024) A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications
- Prompt Engineering Guide Prompting Techniques
- Prompting Fundamentals and How to Apply them Effectively by Eugene Yan
- Tools:
- Chain-of-Thoughts (COT): Paper
- Tree-of-Thoughts (ToT): Paper, impl: Strategic Debate
- Graph-of-Thoughts (GoT): Paper, code
- Algorithm of Thoughts (AoT): Paper
- Chain-of-Verification (CoVe/CoV): Paper
- Mixture-of-Agents (MoA): Paper
- Tool-Integrated Reasoning (ToRA / TIR): Paper
- Program of Thoughts (PoT): Paper
- Buffer of Thoughts (BoT): Paper
- Chain of Code (CoC): Paper
- Thought of Search (ToS): Paper
- Re-Reading the question as input (RE2): Paper
- Self-Harmonized Chain of Thought (ECHO): Paper, code
- Divergent CoT (DCoT), Paper
- Iteration of Thought (IoT), Paper
- Logic-of-Thought (LoT) Paper
- Forest-of-Thought (FoT) Paper
- Anthropic: Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
- Geometric Space of Hierarchical Concepts in LLM
- HF: Scaling Test Time Compute with Open Models
- Nebius: Leveraging training and search for better software engineering agents
- DeepMind AlphaProof and AlphaGeometry 2
- Getting 50% (SoTA) on ARC-AGI with GPT-4o, code: rgreenblatt/arc_draw_more_samples_pub
- Schmidhuber: Artificial Curiosity & Creativity
- synthesis.ai: Do Androids Dream? World Models in Modern AI
- Our Transformers Code Agent beats the GAIA benchmark!
- Lil'Log LLM Powered Autonomous Agents (Jun 2023 )
- BAIR Blog: The Shift from Models to Compound AI Systems
- Microsoft Research Tracing the path to self-adapting AI agents
- LLMs develop their own understanding of reality as their language abilities improve, Emergent Representations Paper
- LessWrong post: LLM Generality is a Timeline Crux
- Three levels of self-building autonomous agents (Tweet thread by Yohei )
- Don't Sleep on Single-agent Systems
- Video: Improving LLM Reasoning using self-generated data: RL and Verifiers, Slides by Rishabh Agarwal (DeepMind)
- Slides: Reasoning with inference-time compute by Sean Welleck, tweet
- Distill A Gentle Introduction to Graph Neural Networks (2021)
- Geometric Deep Learning - Grids, Groups, Graphs, Geodesics, and Gauges
Answering logical queries over Incomplete Knowledge Graphs. Aspirationally this requires combining sparse symbolic index collation (SQL, SPARQL, etc) and dense vector search, preferably in a differentiable manner.
- Neural Graph Reasoning: Complex Logical Query Answering Meets Graph Databases
- Adapting Neural Link Predictors for Data-Efficient Complex Query Answering
- Generalizing Knowledge Graph Embedding with Universal Orthogonal Parameterization
- Knowledge Sheaves: A Sheaf-Theoretic Framework for Knowledge Graph Embedding
- Wasserstein-Fisher-Rao Embedding: Logical Query Embeddings with Local Comparison and Global Transport
- GammaE: Gamma Embeddings for Logical Queries on Knowledge Graphs
- Soft Reasoning on Uncertain Knowledge Graphs
Similar to the regular CQLA, but with the emphasis on the "Inductive Setting" - i.e. querying over new, unseen during training nodes, edge types or even entire graphs. The latter part is interesting as it relies on the higher order "relations between relations" structure, connecting KG inference to Category Theory.
- Zero-shot Logical Query Reasoning on any Knowledge Graph
- Extending Transductive Knowledge Graph Embedding Models for Inductive Logical Relational Inference
- Neural-Symbolic Models for Logical Queries on Knowledge Graphs
- InGram: Inductive Knowledge Graph Embedding via Relation Graphs
Initially attempted back in 2014 with general-purpose but unstable Neural Turing Machines, modern NAR approaches limit their scope to making GNN-based "Algorithmic Processor Networks" which learn to mimic classical algorithms on synthetic data and can be deployed on noisy real-world problems by sandwiching their frozen instances inside Encoder-Processor-Decoder architecture.
- Neural Turing Machines, 2014
- A Generalist Neural Algorithmic Learner
- Transformers meet Neural Algorithmic Reasoners
- Recursive Algorithmic Reasoning
- Dual Algorithmic Reasoning
- Learning to Configure Computer Networks with Neural Algorithmic Reasoning
- Deep Networks Always Grok and Here is Why
- Grokfast: Accelerated Grokking by Amplifying Slow Gradients, review post by Lucas Nestler
- Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
- QwenLM/Qwen-Agent
- meta-llama/llama-agentic-system
- gpt-researcher, docs
- open-interpreter, docs
- ADAS (Automated Design of Agentic Systems)
- AI-Scientist
- Ollama_Agents
- AgentK
- Storm, Paper
- crewAI, docs
- AutoGPT, docs
- AutoGen, docs, AutoGen Studio Paper
- Trace, docs, Paper
- motleycrew, docs
- langflow, docs
- show-me: A Visual and Transparent Reasoning Agent
Weak methods are general but don't use knowledge (heuristics) to guide the search process.
- depth-first-search (DFS)
- breadth-first-search (BFS)
- depth-limited-search, iterative-deepening-depth-first-search (IDDFS)
- generate-and-test
- hill-climbing (borderline case between weak and strong methods)
- The Soar Cognitive Architecture, John E. Laird, MIT Press, 2019
- How to Build a Brain: A Neural Architecture for Biological Cognition Chris Eliasmith, Oxford Series on Cognitive Models and Architectures, 2013
- Active Inference: The Free Energy Principle in Mind, Brain, and Behavior, Thomas Parr, Giovanni Pezzulo, Karl J. Friston, MIT Press, 2022, MLST Interview with Thomas Parr
- Principles of Synthetic Intelligence PSI: An Architecture of Motivated Cognition, Joscha Bach, Oxford Series on Cognitive Models and Architectures Book 4, 2009
- Conscious Mind, Resonant Brain: How Each Brain Makes a Mind, Stephen Grossberg, Oxford University Press, 2021
- The Society of Mind, Marvin Minsky, Simon & Schuster, 1986
- Reinforcement Learning: An Introduction 2nd Edition, Sutton & Barto, MIT Press, 2018
- Reinforcement Learning: An Overview, Dec 2024, Kevin Murphy
- Mathematical Foundations of Reinforcement Learning, Shiyu Zhao, open course on github + video lectures
- Natural Language Cognitive Architecture, David Shapiro, 2022, open source copy
- An Introduction to Universal Artificial Intelligence, Marcus Hutter, David Quarel, Elliot Catt, CRC Press, 2024 - AIXI, Slides, Video
Diverse approaches some of which tap into classical PDE systems of biological NNs, some concentrate on Distibuted Sparse Representations (by default non-differentiable), others draw inspiration from Hippocampal Grid Cells, Place Cells, etc. Biological systems surpass most ML methods for Continual and Online Learning, but are hard to implement efficienly on GPU.
- Ogma Sparse Predictive Hierarchies (SPH): whitepaper
- The Tolman-Eichenbaum Machine: Unifying space and relational memory through generalisation in the hippocampal formation (TEM), TEM-t
- Arousal as a universal embedding for spatiotemporal brain dynamics
- Sparse Distributed Memory is a Continual Learner
- Computation with Sequences of Assemblies in a Model of the Brain
Dense Associative Memory is mainly represented by Modern Hopfield Networks (MHN), which can be viewed as a generalized Transformers capable of storing queries, keys and values explicitly (as in Vector Databases) and running recurrent retrival by energy minimization (relating them to Diffusion models). Application for Continual Learning is possible when combined with uncertainty quantification and differentiable top-k selection.
- xLSTM repository
- CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory
- Energy Transformer
- Memorization and consolidation in associative memory networks
- Simplicial Hopfield networks
- paul-gauthier/aider
- claude-engineer
- continuedev/continue
- OpenHands (formerly OpenDevin)
- princeton-nlp/SWE-agent, documentation
- stanfordnlp/dspy, DSPy awesome list: ganarajpr/awesome-dspy, paper
- InternLM/lagent - lightweight framework for building LLM-based agents
- Software Engineering
- Devin
- Cursor
- Windsurf by Codeium
- GitHub Copilot & copilot-workspace
- textgrad
- Cosine Genie
- v0.dev by Vercel
- Replit AI
- bolt
- continue.dev
- Amazon Q Developer
- Codeyby Sourcegraph
- AWS Automated Reasoning checks
- DevAI: Agent-as-a-Judge: Evaluate Agents with Agents
- AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents, web: project page, gh: stonybrooknlp/appworld, Leaderboard
- CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents, gh: camel-ai/crab
- WebArena: A Realistic Web Environment for Building Autonomous Agents, web: project page, Leaderboard
- ARC-AGI: Leaderboard, On the Measure of Intelligence
- PlanBench: Paper, gh: karthikv792/LLMs-Planning
- GAIA: a benchmark for General AI Assistants: Leaderboard
- StreamBench: Towards Benchmarking Continuous Improvement of Language Agents, gh: stream-bench/stream-bench
- VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
- ZebraLogic, Leaderboard
- Omni-MATH, gh: KbsdJames/Omni-MATH
- BatsResearch/planetarium - Dataset and benchmark for assessing LLMs in translating natural language descriptions of planning problems into PDDL
- SWE-bench, SWE-bench Lite
- BigCodeBench: The Next Generation of HumanEval, Leaderboard
- SciCode: A Research Coding Benchmark Curated by Scientists, web: https://scicode-bench.github.io/
- commit-0 The challenge is to rebuild Python core libraries and pass their unit tests, Leaderboard
- Awesome LLM Strawberry (OpenAI o1)
- awesome-o1 literature list by Sasha Rush
- awesome-ai-agents
- Nous Research Open Reasoning Tasks, a list of reasoning tasks, gh: NousResearch/Open-Reasoning-Tasks
- ARC-AGI Resources Google table paper list by ARC price
- Sasha Rush: Speculations on Test-Time Scaling (o1)
- François Chollet: It's Not About Scale, It's About Abstraction
- Evaluating, Understanding and Improving Approaches for Machine Reasoning
- Channel: David Shapiro
- Artem Kirsanov: Engrams, Building Blocks of Memory in the Brain
- Channel: Edan Meyer on AI, ML & RL, Discrete vs. Continuous RL + Paper
- MIT AGI: Cognitive Architecture (Nate Derbinsky)
- Channel: Thinking About Thinking (Mathematics of Neuroscience and AI)
- Invariance and equivariance in brains and machines
- code_your_own_AI: The CORE IDEA of AI Agents Explained
- SmallThinker-3B-Preview (small model trained on PowerInfer/QWQ-LONGCOT-500K)
- QwQ-32B-Preview, Blog post
- ruliad/deepthought-8b-llama-v0.01-alpha JSON format: 1. Problem understanding, 2. Data gathering, 3. Analysis, 4. Calculation (when applicable), 5. Verification, 6. Conclusion drawing, 7. Implementation
- migtissera/Tess-R1-Limerick-Llama-3.1-70B xml tags:
<thinking>
tag to indicate when the model is performing CoT.<contemplation>
tag when the model contemplate on its answers.<alternatively>
tag for alternate suggestions.<output>
for the final output
- Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
- Memory3: Language Modeling with Explicit Memory
- TTT: Learning to (Learn at Test Time): RNNs with Expressive Hidden States, Video
- TransformerFAM: Feedback attention is working memory
- Machine Consciousness
- Consciousness as a coherence-inducing operator Talk by Josha Bach at the Models of Consciousness Conferences
- The brain simulates actions and their consequences during REM sleep
- CSCG: Clone-structured graph representations enable flexible learning and vicarious evaluation of cognitive maps
- System-1 and System-2 realized within the Common Model of Cognition (2022)
https://s2r-at-scale-workshop.github.io (NeurIPS 2024)
To share a link related to reasoning in AI systems that is missing here please create a pull request for this file. See editing files in the github documentation.