Skip to content

Latest commit

 

History

History
110 lines (98 loc) · 5.08 KB

File metadata and controls

110 lines (98 loc) · 5.08 KB

HPCA 2026

Meta Info

Homepage: https://conf.researchr.org/home/hpca-2026

Paper list: https://2026.hpca-conf.org/track/hpca-2026-main-conference#event-overview

Papers

LLM

  • LLM training
    • AutoHAAP: Automated Heterogeneity-Aware Asymmetric Partitioning for LLM Training
      • Zhejiang Lab
  • LLM inference
    • AUM: Unleashing the Efficiency Potential of Shared Processors with Accelerator Units for LLM Serving [Paper]
      • SJTU & Alibaba
    • GyRot: Leveraging Hidden Synergy between Rotation and Fine-grained Group Quantization for Low-bit LLM Inference
      • KAIST
    • ELORA: Efficient LoRA and KV Cache Management for Multi-LoRA LLM Serving
      • SJTU & Huawei Cloud & HKUST
    • Towards Resource-Efficient Serverless LLM Inference with SLINFER [arXiv]
      • SJTU
    • LILo: Harnessing the On-chip Accelerators in Intel CPUs for Compressed LLM Inference Acceleration
      • UIUC & Seoul National University & Intel
    • PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System [arXiv]
      • Hanyang University & SK hynix & KAIST
  • Speculative decoding
    • Adaptive Draft Sequence Length: Enhancing Speculative Decoding Throughput on PIM-Enabled Systems
      • HUST
  • Wafer
    • WATOS: Efficient LLM Training Strategies and Architecture Co-exploration for Wafer-scale Chip [arXiv]
      • THU
    • TEMP: A Memory Efficient Physical-aware Tensor Partition-Mapping Framework on Wafer-scale Chips [arXiv]
      • THU
    • FACE: Fully PD Overlapped Scheduling and Multi-Level Architecture Co-Exploration on Wafer
      • THU
    • MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference [arXiv]
      • THU
    • HDPAT: Hierarchical Distributed Page Address Translation for Wafer-Scale GPUs
      • William&Mary
    • ReThermal: Co-Design of Thermal-Aware Static and Dynamic Scheduling for LLM Training on Liquid-Cooled Wafer-Scale Chips
      • THU & Shanghai AI Lab
  • Quantization
    • BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache [arXiv]
      • Edinburgh & MSRA
    • AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization
      • Institute of Science Tokyo
  • Reasoning
    • The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective [arXiv]
      • KAIST
    • PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models
      • KAIST
    • RPU - A Reasoning Processing Unit
      • Harvard
  • RAG
    • VectorLiteRAG: Latency-Aware and Fine-Grained Resource Partitioning for Efficient RAG [arXiv]
      • GaTech
  • VLM
    • Focus: A Streaming Concentration Architecture for Efficient Vision-Language Models [arXiv] [Code]
      • Duke
  • Video LLM
    • V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval [arXiv]
      • KAIST
  • Misc
    • Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems
      • SJTU & Huawei
    • RoMe: Row Granularity Access Memory System for Large Language Models [arXiv]
      • Seoul National University & Meta
    • LEGO: Supporting LLM-enhanced Games with One Gaming GPU
      • SJTU & Tongji University

GPU

  • UVM
    • ARIADNE: Adaptive UVM Management for Efficient GPU Memory Oversubscription [Artifact]
      • Yonsei University & DGIST
  • Chiplet
    • COMET: Communication and Memory Co-Design for Fine-Grained AI Inference in MCM Accelerators
      • NUDT & PKU
    • Deadlock-Free Bridge Module for Inter-Chiplet Communication in Open Chiplet Ecosystem
      • NUDT
    • LRM-GPU: Alleviating Synchronization Overhead for Multi-Chiplet GPU Architecture
      • SYSU
  • Sparsity
    • Swift: High-Performance Sparse-Dense Matrix Multiplication on GPUs
      • Hunan University
    • Uni-STC: Unified Sparse Tensor Core
      • CUP-Beijing & NUDT
  • Misc
    • QuCo: Efficient and Flexible Hardware-Driven Automatic Configuration of Tile Transfers in GPUs
      • University of Murcia & William&Mary & NVIDIA
    • μShare: Non-Intrusive Kernel Co-Locating on NVIDIA GPUs
      • TJU
    • FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive operators via Inter-Core Connection [arXiv]
      • SJTU

VAR

  • VAR-Turbo: Unlocking the Potential of Visual Autoregressive Models through Dual Redundancy
    • HKUST

Acronyms

  • LLM: Large Language Model
  • VLM: Vision-Language Model
  • RAG: Retrieval-Augmented Generation
  • UVM: Unified Virtual Memory
  • VAR: Visual AutoRegressive modeling