This repository contains the resources for Test-time Computing: from System-1 Thinking to System-2 Thinking
- Test-Time Training with Self-Supervision for Generalization under Distribution Shifts [ICML 2020] paper
- MT3: Meta Test-Time Training for Self-Supervised Test-Time Adaption [AISTATS 2022] paper code
- Test-Time Training with Masked Autoencoders [NeurIPS 2022] paper
- TTT++: When Does Self-Supervised Test-Time Training Fail or Thrive? [NeurIPS 2021] paper code
- Efficient Test-Time Prompt Tuning for Vision-Language Models [arxiv 2024.8] paper
- Tent: Fully Test-time Adaptation by Entropy Minimization [ICLR 2021] paper code
- MEMO: Test Time Robustness via Adaptation and Augmentation [NeurIPS 2022] paper code
- The Entropy Enigma: Success and Failure of Entropy Minimization [arxiv 2024.5] paper code
- On Pitfalls of Test-Time Adaptation [ICML 2023] paper code
- Beware of Model Collapse! Fast and Stable Test-time Adaptation for Robust Question Answering [EMNLP 2023] paper code
- Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization [NeurIPS 2024] paper
- Protected Test-Time Adaptation via Online Entropy Matching: A Betting Approach [arxiv 2024.8] paper code
- Simulating Bandit Learning from User Feedback for Extractive Question Answering [ACL 2022] paper code
- Using Interactive Feedback to Improve the Accuracy and Explainability of Question Answering Systems Post-Deployment [ACL 2022] paper
- Test-time Adaptation for Machine Translation Evaluation by Uncertainty Minimization [ACL 2023] paper code
- COMET: A Neural Framework for MT Evaluation [EMNLP 2020] paper
- Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models [ICLR 2023] paper code
- Improving robustness against common corruptions by covariate shift adaptation [NeurIPS 2020] paper
- Selective Annotation Makes Language Models Better Few-Shot Learners [arxiv 2022.9] paper code
- Test-Time Adaptation with Perturbation Consistency Learning [arxiv 2023.4] paper
- Test-Time Prompt Adaptation for Vision-Language Models [NeurIPS 2023] paper
- Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning [ICCV 2023] paper code
- Test-Time Model Adaptation with Only Forward Passes [ICML 2024] paper code
- Test-Time Low Rank Adaptation via Confidence Maximization for Zero-Shot Generalization of Vision-Language Models [arxiv 2024.7] paper code
- StreamAdapter: Efficient Test Time Adaptation from Contextual Streams [arxiv 2024.11] paper
- Towards Stable Test-time Adaptation in Dynamic Wild World [ICLR 2023] paper code
- SoTTA: Robust Test-Time Adaptation on Noisy Data Streams [NeurIPS 2023] paper code
- Robust Question Answering against Distribution Shifts with Test-Time Adaption: An Empirical Study [EMNLP 2022] paper code
- What Makes Good In-Context Examples for GPT-3? [DeeLIO 2022] paper
- In-Context Learning with Iterative Demonstration Selection [EMNLP 2024] paper
- Dr.ICL: Demonstration-Retrieved In-context Learning [arxiv 2023.5] paper
- Learning To Retrieve Prompts for In-Context Learning [NAACL 2022] paper
- Unified Demonstration Retriever for In-Context Learning [ACL 2023] paper code
- Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers [ACL 2023] paper code
- Finding Support Examples for In-Context Learning [EMNLP 2023] paper code
- Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning [NeurIPS 2023] paper code
- Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity [ACL 2022] paper
- Self-Adaptive In-Context Learning: An Information Compression Perspective for In-Context Example Selection and Ordering [ACL 2023] paper
- RetICL: Sequential Retrieval of In-Context Examples with Reinforcement Learning [arxiv 2024.4] paper
- Automatic Chain of Thought Prompting in Large Language Models [ICLR 2022] paper code
- Self-ICL: Zero-Shot In-Context Learning with Self-Generated Demonstrations [EMNLP 2023] paper code
- Z-ICL: Zero-Shot In-Context Learning with Pseudo-Demonstrations [ACL 2023] paper
- Self-Generated In-Context Learning: Leveraging Auto-regressive Language Models as a Demonstration Generator [arxiv 2022.6] paper
- Demonstration Augmentation for Zero-shot In-context Learning [ACL 2024] paper code
- Plug and Play Language Models: A Simple Approach to Controlled Text Generation [ICLR 2022] paper
- Steering Language Models With Activation Engineering [arxiv 2024.10] paper
- Improving Instruction-Following in Language Models through Activation Steering [arxiv 2024.10] paper
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model [arxiv 2024.6] paper code
- Refusal in Language Models Is Mediated by a Single Direction [arxiv 2024.10] paper code
- In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering [arxiv 2024.2] paper code
- Investigating Bias Representations in Llama 2 Chat via Activation Steering [arxiv 2024.2] paper
- Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization [arxiv 2024.7] paper code
- Spectral Editing of Activations for Large Language Model Alignment [NeurIPS 2024] paper code
- Multi-property Steering of Large Language Models with Dynamic Activation Composition [BlackboxNLP 2024] paper code
- Generalization through Memorization: Nearest Neighbor Language Models [ICLR 2020] paper code
- Nearest Neighbor Machine Translation [ICLR 2021] paper code
- Efficient Cluster-Based k-Nearest-Neighbor Machine Translation [ACL 2022] paper code
- What Knowledge Is Needed? Towards Explainable Memory for kNN-MT Domain Adaptation [ACL 2023] paper code
- Efficient Domain Adaptation for Non-Autoregressive Machine Translation [ACL 2024] paper code
- kNN-NER: Named Entity Recognition with Nearest Neighbor Search [arxiv 2022.3] paper code
- kNN-CM: A Non-parametric Inference-Phase Adaptation of Parametric Text Classifiers [EMNLP 2023] paper code
- AdaNPC: Exploring Non-Parametric Classifier for Test-Time Adaptation [ICML 2023] paper code
- Training Verifiers to Solve Math Word Problems [arxiv 2021.10] paper
- Advancing LLM Reasoning Generalists with Preference Trees [arxiv 2024.4] paper code
- V-STaR: Training Verifiers for Self-Taught Reasoners [COLM 2024] paper
- Solving math word problems with process- and outcome-based feedback [arxiv 2022.11] paper
- Let's Verify Step by Step [ICLR 2024] paper code
- Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations [ACL 2024] paper
- Improve Mathematical Reasoning in Language Models by Automated Process Supervision [arxiv 20224.6] paper
- Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning [arxiv 20224.10] paper
- Critique-out-Loud Reward Models [arxiv 2024.8] paper code
- Improving Reward Models with Synthetic Critiques [arxiv 2024.5] paper
- Generative Verifiers: Reward Modeling as Next-Token Prediction [arxiv 2024.8] paper
- Self-Generated Critiques Boost Reward Modeling for Language Models [arxiv 2024.11] paper
- Is ChatGPT a Good NLG Evaluator? A Preliminary Study [ACL 2023] paper code
- ChatGPT as a Factual Inconsistency Evaluator for Text Summarization [arxiv 2023.3] paper
- G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment [ACL 2023] pdf code
- Can Large Language Models Be an Alternative to Human Evaluations? [ACL 2023] paper
- LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks [arxiv 2024.6] paper
- Large Language Models are not Fair Evaluators [ACL 2024] paper code
- Large Language Models are Inconsistent and Biased Evaluators [arxiv 2024.5] paper
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [NeurIPS 2023] paper code
- PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization [ICLR 2024] paper
- JudgeLM: Fine-tuned Large Language Models are Scalable Judges [arxiv 2023.10] paper code
- Fennec: Fine-grained Language Model Evaluation and Correction Extended through Branching and Bridging [arxiv 2024.5] paper code
- REFINER: Reasoning Feedback on Intermediate Representations [ACL 2024] paper
- Shepherd: A Critic for Language Model Generation [arxiv 2023.8] paper code
- Generative Judge for Evaluating Alignment [ICLR 2024] paper code
- Can LLMs Produce Faithful Explanations For Fact-checking? Towards Faithful Explainable Fact-Checking via Multi-Agent Debate [ICLR 2024] paper
- Competition-level code generation with alphacode [Science 2022] paper code
- Code Llama: Open Foundation Models for Code [arxiv 2023.8] paper code
- More Agents Is All You Need [arxiv 2024.2] paper code
- Just Ask One More Time! Self-Agreement Improves Reasoning of Language Models in (Almost) All Scenarios [ACL 2024] paper
- Self-Consistency Improves Chain of Thought Reasoning in Language Models [ICLR 2023] paper
- Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning[arxiv 2024.10] paper code
- Learning to summarize with human feedback[NeurIPS 2020] paper
- Training Verifiers to Solve Math Word Problems[arxiv 2021.10] paper
- WebGPT: Browser-assisted question-answering with human feedback [arxiv 2021.12] paper
- Making Language Models Better Reasoners with Step-Aware Verifier [ACL 2023] paper code
- Accelerating Best-of-N via Speculative Rejection [ICML 2024] paper
- TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling [arxiv 2024.10] paper
- Fast Best-of-N Decoding via Speculative Rejection [NeurIPS 2024] paper
- Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation [arxiv 2024.10] paper code
- Preference-Guided Reflective Sampling for Aligning Language Models [EMNLP 2024] paper code
- Reinforced Self-Training (ReST) for Language Modeling[arxiv 2023.8] paper
- Variational Best-of-N Alignment [arxiv 2024.7] paper
- BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling [NeurIps 2024] paper
- BOND: Aligning LLMs with Best-of-N Distillation [arxiv 2024.7] paper
- Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models [arxiv 2024.12] paper
- Reflexion: Language Agents with Verbal Reinforcement Learning [arxiv 2023.3] paper code
- Interscript: A dataset for interactive learning of scripts through error feedback [arxiv 2021.12] paper code
- NL-EDIT: Correcting Semantic Parse Errors through Natural Language Interaction [ACL 2021] paper code
- Learning to repair: Repairing model output errors after deployment using a dynamic memory of feedback [ACL 2022] paper code
- CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [ICLR 2024] paper code
- Teaching Large Language Models to Self-Debug [ICLR 2024] paper
- RARR: Researching and Revising What Language Models Say, Using Language Models [ACL 2023] paper code
- Graph-based, Self-Supervised Program Repair from Diagnostic Feedback [ICML 2020] paper
- Improving Factuality and Reasoning in Language Models through Multiagent Debate [arxiv 2023.5] paper code
- Examining Inter-Consistency of Large Language Models Collaboration: An In-depth Analysis via Debate [EMNLP 2023] paper code
- Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [EMNLP 2024] paper code
- ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs [ACL 2024] paper code
- Mixture-of-Agents Enhances Large Language Model Capabilities [arxiv 2024.6] paper code
- Can LLMs Produce Faithful Explanations For Fact-checking? Towards Faithful Explainable Fact-Checking via Multi-Agent Debate [arxiv 2024.7] paper
- Debating with More Persuasive LLMs Leads to More Truthful Answers [ICML 2024] paper code
- ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [ICLR 2024] pdf
- ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting [IREC 2024] paper
- Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment [arxiv 2023.11] paper
- MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate [arxiv 2024.6] paper
- Teaching Models to Balance Resisting and Accepting Persuasion [arxiv 2024.10] paper code
- GroupDebate: Enhancing the Efficiency of Multi-Agent Debate Using Group Discussion [arxiv 2024.9] paper
- Improving Multi-Agent Debate with Sparse Communication Topology [arxiv 2024.6] paper
- Self-Rewarding Language Models [arxiv 2024.1] paper
- Constitutional AI: Harmlessness from AI Feedback [arxiv 2022.12] paper code
- Self-Refine: Iterative Refinement with Self-Feedback [NeurIPS 2023] paper
- Language Models can Solve Computer Tasks [arxiv 2023.3] paper code
- Confidence Matters: Revisiting Intrinsic Self-Correction Capabilities of Large Language Models [arxiv 2024.2] paper code
- Is Self-Repair a Silver Bullet for Code Generation? [ICLR 2024] paper code
- Large Language Models Cannot Self-Correct Reasoning Yet [ICLR 2024] paper
- Reasoning in Token Economies: Budget-Aware Evaluation of LLM Reasoning Strategies [arxiv 2024.6] paper
- Can Large Language Models Really Improve by Self-critiquing Their Own Plans? [arxiv 2023.10] paper
- GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems [arxiv 2023.10] paper
- When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs [arxiv 2024.6] paper
- LLMs cannot find reasoning errors, but can correct them given the error location [ACL 2024] paper code
- Self-critiquing models for assisting human evaluators [arxiv 2022.6] paper
- Recursive Introspection: Teaching Language Model Agents How to Self-Improve [arxiv 2024.7] paper
- Embedding Self-Correction as an Inherent Ability in Large Language Models for Enhanced Mathematical Reasoning [arxiv 2024.10] paper
- Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning [arxiv 2024.6] paper code
- GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements [arxiv 2024.2] paper
- Generating Sequences by Learning to Self-Correct [ICLR 2023] paper code
- Training Language Models to Self-Correct via Reinforcement Learning [arxiv 2024.9] paper
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models [NeurIPS 2023] paper code
- Self-Evaluation Guided Beam Search for Reasoning [NeurIPS 2023] paper code
- Reasoning with Language Model is Planning with World Model [EMNLP 2023] paper code
- Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B [arxiv 2024.6] paper code
- Reasoning with Language Model is Planning with World Model [EMNLP 2023] paper code
- Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training [arxiv 2023.9] paper
- Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers [arxiv 2024.8] paper code
- Interpretable Contrastive Monte Carlo Tree Search Reasoning [arxiv 2024.10] paper code
- ReST-MCTS: LLM Self-Training via Process Reward Guided Tree Search [arxiv 2024.6] pdf code
- Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [arxiv 2024.5] paper code
- O1 Replication Journey: A Strategic Progress Report -- Part 1 [arxiv 2024.10] paper code
- Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [arxiv 2024.11] paper code
- o1-Coder: an o1 Replication for Coding [arxiv 2024.12] paper code
- DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging [EMNLP 2024] paper code
- Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs [NeurIPS 2024] paper code
- Generalizing Reward Modeling for Out-of-Distribution Preference Learning [ECML-PKDD 2024] paper code
- Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision [arxiv 2023.12] paper
- Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models [ICLR 2024] paper code
- Multimodal Chain-of-Thought Reasoning in Language Models [TMLR 2024] paper code
- Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models [NeurIPS 2024] paper
- KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning [AAAI 2024] paper
- Multimodal Reasoning with Multimodal Knowledge Graph [ACL 2024] paper
- Interleaved-Modal Chain-of-Thought [arxiv 2024.11] paper
- LLaVA-CoT: Let Vision Language Models Reason Step-by-Step [arxiv 2024.11] paper code
- Learning How Hard to Think: Input-Adaptive Allocation of LM Computation [arxiv 2024.10] paper
- Scaling LLM Inference with Optimized Sample Compute Allocation [arxiv 2024.10] paper code
- Reasoning in Token Economies: Budget-Aware Evaluation of LLM Reasoning Strategies [EMNLP 2024] paper
- Token-Budget-Aware LLM Reasoning [arxiv 2024.12] paper code
- Compressed Chain of Thought: Efficient Reasoning Through Dense Representations [arxiv 2024.12] paper
- Large Language Monkeys: Scaling Inference Compute with Repeated Sampling [arxiv 2024.9] paper
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters [arxiv 2024.8] paper
- Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models [arxiv 2024.10] paper code
- A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models [arxiv 2024.11] paper
- The Surprising Effectiveness of Test-Time Training for Abstract Reasoning [arxiv 2024.11] paper code
- Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [arxiv 2024.11] paper code
- Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS [arxiv 2024.11] paper
If our survey is helpful to your research, please cite our paper:
@article{ji2025test,
title={Test-time Computing: from System-1 Thinking to System-2 Thinking},
author={Ji, Yixin and Li, Juntao and Ye, Hai and Wu, Kaixin and Xu, Jia and Mo, Linjian and Zhang, Min},
journal={arXiv preprint arXiv:2501.02497},
year={2025}
}