Welcome to the Awesome-Image-Generation-with-Thinking repository! This repository represents a comprehensive collection of research focused on empowering models to think during image generation. We explore current works and summarize them into three approaches: explicit reflection, reinforcement learning, and unified multimodal models.
- [2025-06] We created this repository to maintain a paper list on Awesome-Image-Generation-With-Thinking. Contributions are welcome!
- βοΈ Survey
- ποΈ Explicit Reflection
- π§ Reinforcement Learning
- π Unified LMMs
- π Benchmarks
Reinforcement learning has been proven to be a crucial step in enhancing reasoning capabilities. Here, we summarize methods that utilize reinforcement learning, such as GRPO, into image generation process.
-
Can we generate images with CoT? Letβs verify and reinforce image generation step by step (Jan., 2025)
-
SimpleAR: Pushing the frontier of autoregressive visual generation through pretraining, SFT, and RL (Apr., 2025)
-
T2I-R1: Reinforcing image generation with collaborative semantic-level and token-level CoT (May, 2025)
-
Flow-GRPO: Training flow matching models via online RL (May, 2025)
-
DanceGRPO: Unleashing GRPO on visual generation (May, 2025)
-
GoT-R1: Unleashing reasoning capability of MLLM for visual generation with reinforcement learning (May, 2025)
-
Co-Reinforcement learning for unified multimodal understanding and generation (May, 2025)
-
ReasonGen-R1: CoT for autoregressive image generation model through SFT and RL (May, 2025)
Reflection is an essantial step in thinking processes. Explicit reflection, which leverages modalities such as text, object coordinates, and image with editing instructions, is a typical approach.
-
Visual programming: Compositional visual reasoning without training (CVPR, 2023)
-
ViperGPT: Visual inference via python execution for reasoning (ICCV, 2023)
-
From reflection to perfection: Scaling inference-time optimization for text-to-image diffusion models via reflection tuning (Apr., 2025)
-
ImageGen-CoT: Enhancing text-to-image in-context learning with chain-of-thought reasoning (Jan., 2025)
-
GoT: Unleashing reasoning capability of multimodal large language model for visual generation and editing (Mar., 2025)
-
Visual planning: Let's think only with images (Mar., 2025)
Unified LMMs inherently excel at text-to-image controllability, hence we collect a list of relevant works.
-
Multi-modal generation via cross-modal in-context learning (May, 2024)
-
Emu: Generative pretraining in multimodality (ICLR, 2024)
-
DreamLLM: Synergistic multimodal comprehension and creation (ICLR, 2024)
-
Making LLaMA see and draw with seed tokenizer (ICLR, 2024)
-
MiniGPT-5: Interleaved vision-and-language generation via generative vokens (Mar., 2024)
-
Generative multimodal models are in-context learners (CVPR, 2024)
-
Unified-IO 2: Scaling autoregressive multimodal models with vision, language, audio, and action (CVPR, 2024)
-
SEED-X: Multimodal models with unified multi-granularity comprehension and generation (May, 2025)
-
Chameleon: Mixed-Modal early-fusion foundation models (May, 2025)
-
Transfusion: Predict the next token and diffuse images with one multi-modal model (Aug., 2024)
-
Show-o: One single transformer to unify multimodal understanding and generation (ICLR, 2025)
-
VILA-U: A unified foundation model integrating visual understanding and generation (Mar., 2025)
-
Emu3: Next-token prediction is all you need (Sep., 2024)
-
Janus: Decoupling visual encoding for unified multimodal understanding and generation (Oct., 2024)
-
JanusFlow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation (Mar., 2025)
-
TokenFlow: Unified image tokenizer for multimodal understanding and generation (CVPR, 2025)
-
MetaMorph: Multimodal understanding and generation via instruction tuning (Dec., 2024)
-
LMFusion: Adapting pretrained language models for multimodal generation (Feb., 2025)
-
Janus-Pro: Unified multimodal understanding and generation with data and model scaling (Jan., 2025)
-
MINT: Multi-modal chain of thought in unified generative models for enhanced image generation (Mar., 2025)
-
Transfer between modalities with metaqueries (Apr., 2025)
-
BLIP3-o: A family of fully open unified multimodal models-architecture, training and dataset (May, 2025)
-
Emerging properties in unified multimodal pretraining (May, 2025)
-
Thinking with generated images (May, 2025)
-
Show-o2: Improved native unified multimodal models (Jun., 2025)
-
ShareGPT-4o-Image: Aligning multimodal models with GPT-4o-level image generation (Jun., 2025)
-
OmniGen2: Exploration to advanced multimodal generation (Jun., 2025)
-
Ovis-U1: Unified Understanding, Generation, and Editing (Jun., 2025)
Essential resources for understanding the broader landscape and evaluating progress in visual reasoning.
-
ELLA: Equip diffusion models with LLM for enhanced semantic alignment (Mar., 2024)
-
T2I-CompBench: A comprehensive benchmark for open-world compositional text-to-image generation (NeurIPS, 2023)
-
GenEval: An object-focused framework for evaluating text-to-image alignment (NeurIPS, 2023)
-
Commonsense-T2I challenge: Can text-to-image generation models understand commonsense? (COLM, 2024)
-
WISE: A world knowledge-informed semantic evaluation for text-to-image generation (May, 2025)
-
TIIF-Bench: How does your T2I model follow your instructions? (Jun., 2025)
-
OneIG-Bench: Omni-dimensional nuanced evaluation for image generation (Jun., 2025)