Skip to content

Latest commit

 

History

History
69 lines (55 loc) · 3.17 KB

File metadata and controls

69 lines (55 loc) · 3.17 KB

ISCA 2025

Meta Info

Homepage: https://iscaconf.org/isca2025/

Paper list: https://www.iscaconf.org/isca2025/program/

Papers

Large Language Models (LLMs)

  • LLM Training
    • Chimera: Communication Fusion for Hybrid Parallelism in Large Language Models [Code]
      • HKUST-GZ
    • MeshSlice: Efficient 2D Tensor Parallelism for Distributed DNN Training [Paper]
      • UIUC
    • Scaling Llama 3 Training with Efficient Parallelism Strategies
      • Industry Track
  • LLM Inference
    • H2-LLM: Hardware-Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference
      • Best Paper Nominee
    • SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting
    • LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference
    • AiF: Accelerating On-Device LLM Inference Using In-Flash Processing
    • LIA: A Single-GPU LLM Inference Acceleration with Cooperative AMX-Enabled CPU-GPU Computation and CXL Offloading
      • UIUC
    • Hybe: GPU-NPU Hybrid System for Efficient LLM Inference with Million-Token Context Window
    • WindServe: Efficient Phase-Disaggregated LLM Serving with Stream-based Dynamic Scheduling
    • Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
      • Industry Track
  • Retrieval-Augmented Generation (RAG)
    • HeterRAG: Heterogeneous Processing-in-Memory Acceleration for Retrieval-augmented Generation
      • HUST
    • Hermes: Algorithm-System Co-design for Efficient Retrieval Augmented Generation At-Scale
    • RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving
  • Quantization & Compression
    • Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
    • Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression
  • Performance modeling
    • AMALI: An Analytical Model for Accurately Modeling LLM Inference on Modern GPUs

Deep Learning Recommendation Models (DLRMs)

  • TRACI: Network Acceleration of Input-Dynamic Communication for Large-Scale Deep Learning Recommendation Model

Resource Management

  • GPU Management
    • Forest: Access-aware GPU UVM Management
    • NetCrafter: Tailoring Network Traffic for Non-Uniform Bandwidth Multi-GPU Systems
    • UGPU: Dynamically Constructing Unbalanced GPUs for Enhanced Resource Efficiency
  • Serverless Computing
    • Single-Address-Space FaaS with Jord
  • Microservices
    • HardHarvest: Hardware-Supported Core Harvesting for Microservices

Performance Analysis & Benchmark

  • Debunking the CUDA Myth Towards GPU-based AI Systems: Evaluation of the Performance and Programmability of Intel's Gaudi NPU for AI Model Serving
  • Dynamic Load Balancer in Intel Xeon Scalable Processor: Performance Analyses, Enhancements, and Guidelines
    • UIUC
  • DCPerf: An Open-Source, Battle-Tested Performance Benchmark Suite for Datacenter Workloads
    • Industry Track

AI Chip

  • Meta's Second Generation AI Chip: Model-Chip Co-Design and Productionization Experiences
    • Industry Track