Homepage: https://conf.researchr.org/home/hpca-2026
Paper list: https://2026.hpca-conf.org/track/hpca-2026-main-conference#event-overview
- LLM training
- AutoHAAP: Automated Heterogeneity-Aware Asymmetric Partitioning for LLM Training
- Zhejiang Lab
- AutoHAAP: Automated Heterogeneity-Aware Asymmetric Partitioning for LLM Training
- LLM inference
- AUM: Unleashing the Efficiency Potential of Shared Processors with Accelerator Units for LLM Serving [Paper]
- SJTU & Alibaba
- GyRot: Leveraging Hidden Synergy between Rotation and Fine-grained Group Quantization for Low-bit LLM Inference
- KAIST
- ELORA: Efficient LoRA and KV Cache Management for Multi-LoRA LLM Serving
- SJTU & Huawei Cloud & HKUST
- Towards Resource-Efficient Serverless LLM Inference with SLINFER [arXiv]
- SJTU
- LILo: Harnessing the On-chip Accelerators in Intel CPUs for Compressed LLM Inference Acceleration
- UIUC & Seoul National University & Intel
- PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System [arXiv]
- Hanyang University & SK hynix & KAIST
- AUM: Unleashing the Efficiency Potential of Shared Processors with Accelerator Units for LLM Serving [Paper]
- Speculative decoding
- Adaptive Draft Sequence Length: Enhancing Speculative Decoding Throughput on PIM-Enabled Systems
- HUST
- Adaptive Draft Sequence Length: Enhancing Speculative Decoding Throughput on PIM-Enabled Systems
- Wafer
- WATOS: Efficient LLM Training Strategies and Architecture Co-exploration for Wafer-scale Chip [arXiv]
- THU
- TEMP: A Memory Efficient Physical-aware Tensor Partition-Mapping Framework on Wafer-scale Chips [arXiv]
- THU
- FACE: Fully PD Overlapped Scheduling and Multi-Level Architecture Co-Exploration on Wafer
- THU
- MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference [arXiv]
- THU
- HDPAT: Hierarchical Distributed Page Address Translation for Wafer-Scale GPUs
- William&Mary
- ReThermal: Co-Design of Thermal-Aware Static and Dynamic Scheduling for LLM Training on Liquid-Cooled Wafer-Scale Chips
- THU & Shanghai AI Lab
- WATOS: Efficient LLM Training Strategies and Architecture Co-exploration for Wafer-scale Chip [arXiv]
- Quantization
- BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache [arXiv]
- Edinburgh & MSRA
- AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization
- Institute of Science Tokyo
- BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache [arXiv]
- Reasoning
- The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective [arXiv]
- KAIST
- PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models
- KAIST
- RPU - A Reasoning Processing Unit
- Harvard
- The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective [arXiv]
- RAG
- VectorLiteRAG: Latency-Aware and Fine-Grained Resource Partitioning for Efficient RAG [arXiv]
- GaTech
- VectorLiteRAG: Latency-Aware and Fine-Grained Resource Partitioning for Efficient RAG [arXiv]
- VLM
- Video LLM
- V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval [arXiv]
- KAIST
- V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval [arXiv]
- Misc
- Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems
- SJTU & Huawei
- RoMe: Row Granularity Access Memory System for Large Language Models [arXiv]
- Seoul National University & Meta
- LEGO: Supporting LLM-enhanced Games with One Gaming GPU
- SJTU & Tongji University
- Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems
- UVM
- ARIADNE: Adaptive UVM Management for Efficient GPU Memory Oversubscription [Artifact]
- Yonsei University & DGIST
- ARIADNE: Adaptive UVM Management for Efficient GPU Memory Oversubscription [Artifact]
- Chiplet
- COMET: Communication and Memory Co-Design for Fine-Grained AI Inference in MCM Accelerators
- NUDT & PKU
- Deadlock-Free Bridge Module for Inter-Chiplet Communication in Open Chiplet Ecosystem
- NUDT
- LRM-GPU: Alleviating Synchronization Overhead for Multi-Chiplet GPU Architecture
- SYSU
- COMET: Communication and Memory Co-Design for Fine-Grained AI Inference in MCM Accelerators
- Sparsity
- Swift: High-Performance Sparse-Dense Matrix Multiplication on GPUs
- Hunan University
- Uni-STC: Unified Sparse Tensor Core
- CUP-Beijing & NUDT
- Swift: High-Performance Sparse-Dense Matrix Multiplication on GPUs
- Misc
- QuCo: Efficient and Flexible Hardware-Driven Automatic Configuration of Tile Transfers in GPUs
- University of Murcia & William&Mary & NVIDIA
- μShare: Non-Intrusive Kernel Co-Locating on NVIDIA GPUs
- TJU
- FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive operators via Inter-Core Connection [arXiv]
- SJTU
- QuCo: Efficient and Flexible Hardware-Driven Automatic Configuration of Tile Transfers in GPUs
- VAR-Turbo: Unlocking the Potential of Visual Autoregressive Models through Dual Redundancy
- HKUST
- LLM: Large Language Model
- VLM: Vision-Language Model
- RAG: Retrieval-Augmented Generation
- UVM: Unified Virtual Memory
- VAR: Visual AutoRegressive modeling