Awesome Test-time Computing

This repository contains the resources for Test-time Computing: from System-1 Thinking to System-2 Thinking

Overview of Test-time Computing:

Test-time Adaptation

Updating the Model

Test-Time Training with Self-Supervision for Generalization under Distribution Shifts [ICML 2020] paper
MT3: Meta Test-Time Training for Self-Supervised Test-Time Adaption [AISTATS 2022] paper code
Test-Time Training with Masked Autoencoders [NeurIPS 2022] paper
TTT++: When Does Self-Supervised Test-Time Training Fail or Thrive? [NeurIPS 2021] paper code
Efficient Test-Time Prompt Tuning for Vision-Language Models [arxiv 2024.8] paper
Tent: Fully Test-time Adaptation by Entropy Minimization [ICLR 2021] paper code
MEMO: Test Time Robustness via Adaptation and Augmentation [NeurIPS 2022] paper code
The Entropy Enigma: Success and Failure of Entropy Minimization [arxiv 2024.5] paper code
On Pitfalls of Test-Time Adaptation [ICML 2023] paper code
Beware of Model Collapse! Fast and Stable Test-time Adaptation for Robust Question Answering [EMNLP 2023] paper code
Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization [NeurIPS 2024] paper
Protected Test-Time Adaptation via Online Entropy Matching: A Betting Approach [arxiv 2024.8] paper code
Simulating Bandit Learning from User Feedback for Extractive Question Answering [ACL 2022] paper code
Using Interactive Feedback to Improve the Accuracy and Explainability of Question Answering Systems Post-Deployment [ACL 2022] paper
Test-time Adaptation for Machine Translation Evaluation by Uncertainty Minimization [ACL 2023] paper code
COMET: A Neural Framework for MT Evaluation [EMNLP 2020] paper
Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models [ICLR 2023] paper code
Improving robustness against common corruptions by covariate shift adaptation [NeurIPS 2020] paper
Selective Annotation Makes Language Models Better Few-Shot Learners [arxiv 2022.9] paper code
Test-Time Adaptation with Perturbation Consistency Learning [arxiv 2023.4] paper
Test-Time Prompt Adaptation for Vision-Language Models [NeurIPS 2023] paper
Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning [ICCV 2023] paper code
Test-Time Model Adaptation with Only Forward Passes [ICML 2024] paper code
Test-Time Low Rank Adaptation via Confidence Maximization for Zero-Shot Generalization of Vision-Language Models [arxiv 2024.7] paper code
StreamAdapter: Efficient Test Time Adaptation from Contextual Streams [arxiv 2024.11] paper
Towards Stable Test-time Adaptation in Dynamic Wild World [ICLR 2023] paper code
SoTTA: Robust Test-Time Adaptation on Noisy Data Streams [NeurIPS 2023] paper code
Robust Question Answering against Distribution Shifts with Test-Time Adaption: An Empirical Study [EMNLP 2022] paper code

Modifying the Input

What Makes Good In-Context Examples for GPT-3? [DeeLIO 2022] paper
In-Context Learning with Iterative Demonstration Selection [EMNLP 2024] paper
Dr.ICL: Demonstration-Retrieved In-context Learning [arxiv 2023.5] paper
Learning To Retrieve Prompts for In-Context Learning [NAACL 2022] paper
Unified Demonstration Retriever for In-Context Learning [ACL 2023] paper code
Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers [ACL 2023] paper code
Finding Support Examples for In-Context Learning [EMNLP 2023] paper code
Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning [NeurIPS 2023] paper code
Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity [ACL 2022] paper
Self-Adaptive In-Context Learning: An Information Compression Perspective for In-Context Example Selection and Ordering [ACL 2023] paper
RetICL: Sequential Retrieval of In-Context Examples with Reinforcement Learning [arxiv 2024.4] paper
Automatic Chain of Thought Prompting in Large Language Models [ICLR 2022] paper code
Self-ICL: Zero-Shot In-Context Learning with Self-Generated Demonstrations [EMNLP 2023] paper code
Z-ICL: Zero-Shot In-Context Learning with Pseudo-Demonstrations [ACL 2023] paper
Self-Generated In-Context Learning: Leveraging Auto-regressive Language Models as a Demonstration Generator [arxiv 2022.6] paper
Demonstration Augmentation for Zero-shot In-context Learning [ACL 2024] paper code

Editing the Representation

Plug and Play Language Models: A Simple Approach to Controlled Text Generation [ICLR 2022] paper
Steering Language Models With Activation Engineering [arxiv 2024.10] paper
Improving Instruction-Following in Language Models through Activation Steering [arxiv 2024.10] paper
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model [arxiv 2024.6] paper code
Refusal in Language Models Is Mediated by a Single Direction [arxiv 2024.10] paper code
In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering [arxiv 2024.2] paper code
Investigating Bias Representations in Llama 2 Chat via Activation Steering [arxiv 2024.2] paper
Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization [arxiv 2024.7] paper code
Spectral Editing of Activations for Large Language Model Alignment [NeurIPS 2024] paper code
Multi-property Steering of Large Language Models with Dynamic Activation Composition [BlackboxNLP 2024] paper code

Calibrating the Output

Generalization through Memorization: Nearest Neighbor Language Models [ICLR 2020] paper code
Nearest Neighbor Machine Translation [ICLR 2021] paper code
Efficient Cluster-Based k-Nearest-Neighbor Machine Translation [ACL 2022] paper code
What Knowledge Is Needed? Towards Explainable Memory for kNN-MT Domain Adaptation [ACL 2023] paper code
Efficient Domain Adaptation for Non-Autoregressive Machine Translation [ACL 2024] paper code
kNN-NER: Named Entity Recognition with Nearest Neighbor Search [arxiv 2022.3] paper code
kNN-CM: A Non-parametric Inference-Phase Adaptation of Parametric Text Classifiers [EMNLP 2023] paper code
AdaNPC: Exploring Non-Parametric Classifier for Test-Time Adaptation [ICML 2023] paper code

Test-time Reasoning

Feedback Modeling

Score-based

Training Verifiers to Solve Math Word Problems [arxiv 2021.10] paper
Advancing LLM Reasoning Generalists with Preference Trees [arxiv 2024.4] paper code
V-STaR: Training Verifiers for Self-Taught Reasoners [COLM 2024] paper
Solving math word problems with process- and outcome-based feedback [arxiv 2022.11] paper
Let's Verify Step by Step [ICLR 2024] paper code
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations [ACL 2024] paper
Improve Mathematical Reasoning in Language Models by Automated Process Supervision [arxiv 20224.6] paper
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning [arxiv 20224.10] paper
Critique-out-Loud Reward Models [arxiv 2024.8] paper code
Improving Reward Models with Synthetic Critiques [arxiv 2024.5] paper
Generative Verifiers: Reward Modeling as Next-Token Prediction [arxiv 2024.8] paper
Self-Generated Critiques Boost Reward Modeling for Language Models [arxiv 2024.11] paper

Verbal-based

Is ChatGPT a Good NLG Evaluator? A Preliminary Study [ACL 2023] paper code
ChatGPT as a Factual Inconsistency Evaluator for Text Summarization [arxiv 2023.3] paper
G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment [ACL 2023] pdf code
Can Large Language Models Be an Alternative to Human Evaluations? [ACL 2023] paper
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks [arxiv 2024.6] paper
Large Language Models are not Fair Evaluators [ACL 2024] paper code
Large Language Models are Inconsistent and Biased Evaluators [arxiv 2024.5] paper
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [NeurIPS 2023] paper code
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization [ICLR 2024] paper
JudgeLM: Fine-tuned Large Language Models are Scalable Judges [arxiv 2023.10] paper code
Fennec: Fine-grained Language Model Evaluation and Correction Extended through Branching and Bridging [arxiv 2024.5] paper code
REFINER: Reasoning Feedback on Intermediate Representations [ACL 2024] paper
Shepherd: A Critic for Language Model Generation [arxiv 2023.8] paper code
Generative Judge for Evaluating Alignment [ICLR 2024] paper code
Can LLMs Produce Faithful Explanations For Fact-checking? Towards Faithful Explainable Fact-Checking via Multi-Agent Debate [ICLR 2024] paper

Search Strategies

Repeated sampling

Competition-level code generation with alphacode [Science 2022] paper code
Code Llama: Open Foundation Models for Code [arxiv 2023.8] paper code
More Agents Is All You Need [arxiv 2024.2] paper code
Just Ask One More Time! Self-Agreement Improves Reasoning of Language Models in (Almost) All Scenarios [ACL 2024] paper
Self-Consistency Improves Chain of Thought Reasoning in Language Models [ICLR 2023] paper
Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning[arxiv 2024.10] paper code
Learning to summarize with human feedback[NeurIPS 2020] paper
Training Verifiers to Solve Math Word Problems[arxiv 2021.10] paper
WebGPT: Browser-assisted question-answering with human feedback [arxiv 2021.12] paper
Making Language Models Better Reasoners with Step-Aware Verifier [ACL 2023] paper code
Accelerating Best-of-N via Speculative Rejection [ICML 2024] paper
TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling [arxiv 2024.10] paper
Fast Best-of-N Decoding via Speculative Rejection [NeurIPS 2024] paper
Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation [arxiv 2024.10] paper code
Preference-Guided Reflective Sampling for Aligning Language Models [EMNLP 2024] paper code
Reinforced Self-Training (ReST) for Language Modeling[arxiv 2023.8] paper
Variational Best-of-N Alignment [arxiv 2024.7] paper
BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling [NeurIps 2024] paper
BOND: Aligning LLMs with Best-of-N Distillation [arxiv 2024.7] paper
Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models [arxiv 2024.12] paper

Self-correction

Reflexion: Language Agents with Verbal Reinforcement Learning [arxiv 2023.3] paper code
Interscript: A dataset for interactive learning of scripts through error feedback [arxiv 2021.12] paper code
NL-EDIT: Correcting Semantic Parse Errors through Natural Language Interaction [ACL 2021] paper code
Learning to repair: Repairing model output errors after deployment using a dynamic memory of feedback [ACL 2022] paper code
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [ICLR 2024] paper code
Teaching Large Language Models to Self-Debug [ICLR 2024] paper
RARR: Researching and Revising What Language Models Say, Using Language Models [ACL 2023] paper code
Graph-based, Self-Supervised Program Repair from Diagnostic Feedback [ICML 2020] paper
Improving Factuality and Reasoning in Language Models through Multiagent Debate [arxiv 2023.5] paper code
Examining Inter-Consistency of Large Language Models Collaboration: An In-depth Analysis via Debate [EMNLP 2023] paper code
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [EMNLP 2024] paper code
ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs [ACL 2024] paper code
Mixture-of-Agents Enhances Large Language Model Capabilities [arxiv 2024.6] paper code
Can LLMs Produce Faithful Explanations For Fact-checking? Towards Faithful Explainable Fact-Checking via Multi-Agent Debate [arxiv 2024.7] paper
Debating with More Persuasive LLMs Leads to More Truthful Answers [ICML 2024] paper code
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [ICLR 2024] pdf
ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting [IREC 2024] paper
Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment [arxiv 2023.11] paper
MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate [arxiv 2024.6] paper
Teaching Models to Balance Resisting and Accepting Persuasion [arxiv 2024.10] paper code
GroupDebate: Enhancing the Efficiency of Multi-Agent Debate Using Group Discussion [arxiv 2024.9] paper
Improving Multi-Agent Debate with Sparse Communication Topology [arxiv 2024.6] paper
Self-Rewarding Language Models [arxiv 2024.1] paper
Constitutional AI: Harmlessness from AI Feedback [arxiv 2022.12] paper code
Self-Refine: Iterative Refinement with Self-Feedback [NeurIPS 2023] paper
Language Models can Solve Computer Tasks [arxiv 2023.3] paper code
Confidence Matters: Revisiting Intrinsic Self-Correction Capabilities of Large Language Models [arxiv 2024.2] paper code
Is Self-Repair a Silver Bullet for Code Generation? [ICLR 2024] paper code
Large Language Models Cannot Self-Correct Reasoning Yet [ICLR 2024] paper
Reasoning in Token Economies: Budget-Aware Evaluation of LLM Reasoning Strategies [arxiv 2024.6] paper
Can Large Language Models Really Improve by Self-critiquing Their Own Plans? [arxiv 2023.10] paper
GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems [arxiv 2023.10] paper
When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs [arxiv 2024.6] paper
LLMs cannot find reasoning errors, but can correct them given the error location [ACL 2024] paper code
Self-critiquing models for assisting human evaluators [arxiv 2022.6] paper
Recursive Introspection: Teaching Language Model Agents How to Self-Improve [arxiv 2024.7] paper
Embedding Self-Correction as an Inherent Ability in Large Language Models for Enhanced Mathematical Reasoning [arxiv 2024.10] paper
Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning [arxiv 2024.6] paper code
GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements [arxiv 2024.2] paper
Generating Sequences by Learning to Self-Correct [ICLR 2023] paper code
Training Language Models to Self-Correct via Reinforcement Learning [arxiv 2024.9] paper

Tree search

Tree of Thoughts: Deliberate Problem Solving with Large Language Models [NeurIPS 2023] paper code
Self-Evaluation Guided Beam Search for Reasoning [NeurIPS 2023] paper code
Reasoning with Language Model is Planning with World Model [EMNLP 2023] paper code
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B [arxiv 2024.6] paper code
Reasoning with Language Model is Planning with World Model [EMNLP 2023] paper code
Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training [arxiv 2023.9] paper
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers [arxiv 2024.8] paper code
Interpretable Contrastive Monte Carlo Tree Search Reasoning [arxiv 2024.10] paper code
ReST-MCTS: LLM Self-Training via Process Reward Guided Tree Search [arxiv 2024.6] pdf code
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [arxiv 2024.5] paper code
O1 Replication Journey: A Strategic Progress Report -- Part 1 [arxiv 2024.10] paper code
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [arxiv 2024.11] paper code
o1-Coder: an o1 Replication for Coding [arxiv 2024.12] paper code

Future Directions

Generalizable System-2 Model

DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging [EMNLP 2024] paper code
Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs [NeurIPS 2024] paper code
Generalizing Reward Modeling for Out-of-Distribution Preference Learning [ECML-PKDD 2024] paper code
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision [arxiv 2023.12] paper

Multimodal Reasoning

Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models [ICLR 2024] paper code
Multimodal Chain-of-Thought Reasoning in Language Models [TMLR 2024] paper code
Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models [NeurIPS 2024] paper
KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning [AAAI 2024] paper
Multimodal Reasoning with Multimodal Knowledge Graph [ACL 2024] paper
Interleaved-Modal Chain-of-Thought [arxiv 2024.11] paper
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step [arxiv 2024.11] paper code

Efficiency and Performance Trade-off

Learning How Hard to Think: Input-Adaptive Allocation of LM Computation [arxiv 2024.10] paper
Scaling LLM Inference with Optimized Sample Compute Allocation [arxiv 2024.10] paper code
Reasoning in Token Economies: Budget-Aware Evaluation of LLM Reasoning Strategies [EMNLP 2024] paper
Token-Budget-Aware LLM Reasoning [arxiv 2024.12] paper code
Compressed Chain of Thought: Efficient Reasoning Through Dense Representations [arxiv 2024.12] paper

Scaling Law

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling [arxiv 2024.9] paper
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters [arxiv 2024.8] paper
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models [arxiv 2024.10] paper code
A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models [arxiv 2024.11] paper

Strategy Combination

The Surprising Effectiveness of Test-Time Training for Abstract Reasoning [arxiv 2024.11] paper code
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [arxiv 2024.11] paper code
Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS [arxiv 2024.11] paper

Reference

If our survey is helpful to your research, please cite our paper:

@article{ji2025test,
  title={Test-time Computing: from System-1 Thinking to System-2 Thinking},
  author={Ji, Yixin and Li, Juntao and Ye, Hai and Wu, Kaixin and Xu, Jia and Mo, Linjian and Zhang, Min},
  journal={arXiv preprint arXiv:2501.02497},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
README.md		README.md
taxonomy.jpg		taxonomy.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Test-time Computing

Overview of Test-time Computing:

Test-time Adaptation

Updating the Model

Modifying the Input

Editing the Representation

Calibrating the Output

Test-time Reasoning

Feedback Modeling

Score-based

Verbal-based

Search Strategies

Repeated sampling

Self-correction

Tree search

Future Directions

Generalizable System-2 Model

Multimodal Reasoning

Efficiency and Performance Trade-off

Scaling Law

Strategy Combination

Reference

About

Releases

Packages

Contributors 3

Dereck0602/Awesome_Test_Time_LLMs

Folders and files

Latest commit

History

Repository files navigation

Awesome Test-time Computing

Overview of Test-time Computing:

Test-time Adaptation

Updating the Model

Modifying the Input

Editing the Representation

Calibrating the Output

Test-time Reasoning

Feedback Modeling

Score-based

Verbal-based

Search Strategies

Repeated sampling

Self-correction

Tree search

Future Directions

Generalizable System-2 Model

Multimodal Reasoning

Efficiency and Performance Trade-off

Scaling Law

Strategy Combination

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages