Homepage: https://iscaconf.org/isca2025/
Paper list: https://www.iscaconf.org/isca2025/program/
- LLM Training
- LLM Inference
- H2-LLM: Hardware-Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference
- Best Paper Nominee
- SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting
- LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference
- AiF: Accelerating On-Device LLM Inference Using In-Flash Processing
- LIA: A Single-GPU LLM Inference Acceleration with Cooperative AMX-Enabled CPU-GPU Computation and CXL Offloading
- UIUC
- Hybe: GPU-NPU Hybrid System for Efficient LLM Inference with Million-Token Context Window
- WindServe: Efficient Phase-Disaggregated LLM Serving with Stream-based Dynamic Scheduling
- Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
- Industry Track
- H2-LLM: Hardware-Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference
- Retrieval-Augmented Generation (RAG)
- HeterRAG: Heterogeneous Processing-in-Memory Acceleration for Retrieval-augmented Generation
- HUST
- Hermes: Algorithm-System Co-design for Efficient Retrieval Augmented Generation At-Scale
- RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving
- HeterRAG: Heterogeneous Processing-in-Memory Acceleration for Retrieval-augmented Generation
- Quantization & Compression
- Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
- Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression
- Performance modeling
- AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs
- TRACI: Network Acceleration of Input-Dynamic Communication for Large-Scale Deep Learning Recommendation Model
- GPU Management
- Forest: Access-aware GPU UVM Management
- NetCrafter: Tailoring Network Traffic for Non-Uniform Bandwidth Multi-GPU Systems
- UGPU: Dynamically Constructing Unbalanced GPUs for Enhanced Resource Efficiency
- Serverless Computing
- Single-Address-Space FaaS with Jord
- Microservices
- HardHarvest: Hardware-Supported Core Harvesting for Microservices
- Debunking the CUDA Myth Towards GPU-based AI Systems: Evaluation of the Performance and Programmability of Intel's Gaudi NPU for AI Model Serving
- Dynamic Load Balancer in Intel Xeon Scalable Processor: Performance Analyses, Enhancements, and Guidelines
- UIUC
- DCPerf: An Open-Source, Battle-Tested Performance Benchmark Suite for Datacenter Workloads
- Industry Track
- Meta's Second Generation AI Chip: Model-Chip Co-Design and Productionization Experiences
- Industry Track