ISCA 2025

Meta Info

Homepage: https://iscaconf.org/isca2025/

Paper list: https://www.iscaconf.org/isca2025/program/

Papers

Large Language Models (LLMs)

LLM Training
- Chimera: Communication Fusion for Hybrid Parallelism in Large Language Models [Code]
  - HKUST-GZ
- MeshSlice: Efficient 2D Tensor Parallelism for Distributed DNN Training [Paper]
  - UIUC
- Scaling Llama 3 Training with Efficient Parallelism Strategies
  - Industry Track
LLM Inference
- H2-LLM: Hardware-Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference
  - Best Paper Nominee
- SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting
- LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference
- AiF: Accelerating On-Device LLM Inference Using In-Flash Processing
- LIA: A Single-GPU LLM Inference Acceleration with Cooperative AMX-Enabled CPU-GPU Computation and CXL Offloading
  - UIUC
- Hybe: GPU-NPU Hybrid System for Efficient LLM Inference with Million-Token Context Window
- WindServe: Efficient Phase-Disaggregated LLM Serving with Stream-based Dynamic Scheduling
- Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
  - Industry Track
Retrieval-Augmented Generation (RAG)
- HeterRAG: Heterogeneous Processing-in-Memory Acceleration for Retrieval-augmented Generation
  - HUST
- Hermes: Algorithm-System Co-design for Efficient Retrieval Augmented Generation At-Scale
- RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving
Quantization & Compression
- Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
- Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression
Performance modeling
- AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs

Deep Learning Recommendation Models (DLRMs)

TRACI: Network Acceleration of Input-Dynamic Communication for Large-Scale Deep Learning Recommendation Model

Resource Management

GPU Management
- Forest: Access-aware GPU UVM Management
- NetCrafter: Tailoring Network Traffic for Non-Uniform Bandwidth Multi-GPU Systems
- UGPU: Dynamically Constructing Unbalanced GPUs for Enhanced Resource Efficiency
Serverless Computing
- Single-Address-Space FaaS with Jord
Microservices
- HardHarvest: Hardware-Supported Core Harvesting for Microservices

Performance Analysis & Benchmark

Debunking the CUDA Myth Towards GPU-based AI Systems: Evaluation of the Performance and Programmability of Intel's Gaudi NPU for AI Model Serving
Dynamic Load Balancer in Intel Xeon Scalable Processor: Performance Analyses, Enhancements, and Guidelines
- UIUC
DCPerf: An Open-Source, Battle-Tested Performance Benchmark Suite for Datacenter Workloads
- Industry Track

AI Chip

Meta's Second Generation AI Chip: Model-Chip Co-Design and Productionization Experiences
- Industry Track

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ISCA 2025

Meta Info

Papers

Large Language Models (LLMs)

Deep Learning Recommendation Models (DLRMs)

Resource Management

Performance Analysis & Benchmark

AI Chip

FilesExpand file tree

isca-2025.md

Latest commit

History

isca-2025.md

File metadata and controls

ISCA 2025

Meta Info

Papers

Large Language Models (LLMs)

Deep Learning Recommendation Models (DLRMs)

Resource Management

Performance Analysis & Benchmark

AI Chip