Skip to content

KuangjuX/Paper-reading

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

Paper Reading

Math

  • Categorical Foundations for CuTe Layouts [Paper]

GPU Microarchitecture

  • Understanding Latency Hiding on GPUs [Paper]

Deep Learning Compiler

  • The Deep Learning Compiler: A Comprehensive Survey: [Paper] [Note]
  • Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks[OSDI'20]: [Paper] [Note]
  • ROLLER: Fast and Efficient Tensor Compilation for Deep Learning[OSDI'22]: [Paper] [Note]
  • BOLT: Brinding The Gap Between Auto-Tunners and Hardware-Native Performance[MLSys'22]: [Paper] [Note]
  • AStitch: Enabling a New Multi-dimensional Optimization Space for Memory-Intensive ML Training and Inference on Modern SIMT Architectures[ASPLOS'22]: [Paper] [Note]
  • Welder: Scheduling Deep Learning Memory Access via Tile-graph[OSDI'23]: [Paper] [Note]
  • Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators[OSDI'23]: [Paper]
  • Cocktailer: Analyzing and Optimizing Dynamic Control Flow in Deep Learning[OSDI'23]: [Paper] [Note]
  • Chimera: An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion[HPCA'23]: [Paper] [Note]
  • MLIR: Scaling Compiler Infrastructure for Domain Specifific Computation[CGO'21]: [Paper] [Note]
  • Graphene: An IR for Optimized Tensor Computations on GPUs[ASPLOS'23]: [Paper] [Note]
  • TIRAMISU: A Polyhedral Compiler for Expressing Fast and Portable Code[CGO'19]: [Paper] [Note]
  • AMOS: Enabling Automatic Mapping for Tensor Computations On Spatial Accelerators with Hardware Abstraction[ISCA'22]: [Paper] [Note]
  • Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor[SOSP'24] [Paper]
  • ThunderKittens: Simple, Fast, and Adorable AI Kernels [Paper]
  • Mirage: A Multi-Level Superoptimizer for Tensor Programs[OSDI'25] [Paper]
  • PipeThreader: Software-Defined Pipelining for Efficient DNN Execution[OSDI'25] [Paper]
  • KPerfIR: Towards a Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads[OSDI'25] [Paper]
  • TileLang: A Composable Tiled Programming Model for AI Systems [Paper]

LLM Inference

  • WaferLLM: Large Language Model Inference at Wafer Scale [OSDI'25] [Paper]

Long Context Inference

  • Training-Free Long-Context Scaling of Large Language Models[ICML'24]: [Paper] [Note]
  • Efficient Streaming Language Models with Attention Sinks[ICLR'24]: [Paper]
  • Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference[ICML'24]: [Paper]
  • DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads[ICLR'25]: [Paper]

LLM Serving

  • SGLang: Efficient Execution of Structured language Model Programs [Paper]
  • FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving [Paper]
  • DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving[OSDI'24] [Paper]
  • LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism[SOSP'24] [Paper]
  • Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot[FAST'25] [Paper]
  • NanoFlow: Towards Optimal Large Language Model Serving Throughput[OSDI'25] [Paper]
  • A Suvery of LLM Inference Systems [Paper] [Note]

MegaKernel

  • Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B [Paper]
  • Compiling LLMs into a MegaKernel: A Path to Low-Latency Inference [Paper]

LLM Training

Distributed Training

  • LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism [Paper]

Compute-Communication Overlap

  • Flux: Fast Software-based Communication Overlap on GPUs through Kernel Fusion [Paper]
  • Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts [Paper]
  • Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning[ASPLOS'24 Best Paper] [Paper]
  • TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives[MLSys'25] [Paper]
  • Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler [Paper]
  • FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation[Eurosys'25] [Paper]
  • TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference [Paper]

LLM Attention

  • A Survey of Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention [Paper]

LLM System for Mobile

  • On-Device Training Under 256KB Memory[NeurIPS'22]: [Paper]
  • PockEngine: Sparse and Efficient Fine-tuning in a Pocket[MICRO'23]: [Paper] [Note]

Deep Learning

Attention with Variants

  • Attention is all you need[NIPS'17]: [Paper] [Note]
  • Big bird: transformers for longer sequences[NIPS'20]: [Paper] [Note]
  • FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness[NIPS'22]: [Paper] [Note]
  • FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning[Arxiv]: [Paper] [Note]
  • Flash-Decoding for long-context inference[Blog]: [Paper] [Note]

New Architecture for LLM

  • Gated Linear Attention Transformers with Hardware-Efficient Training[arxiv]: [Paper] [Note]

Compiler

  • Honeycomb: Secure and Efficient GPU Executions via Static Validation[OSDI'23]: [Paper] [Note]
  • HIDA: A Hierarchical Dataflow Compiler for High-Level Synthesis[ASPLOS'24]: [Paper] [Note]

OS

  • RedLeaf: Isolation and Communication in a Safe Operating System[OSDI'20]: [Paper] [Note]
  • Theseus: an Experiment in Operating System Structure and State Management[OSDI'20]: [Paper]
  • Unikraft: Fast, Specialized Unikernels the Easy Way[EuroSys'21]: [Paper] [Note]
  • The Deminkernel Datapath OS Architecture for Microsecond-scale Datacenter Systems[SOSP'21]: [Paper] [Note]

Hypervisor

  • HyperBench: A Benchmark Suite for Virtualization Capabilities: [Paper] [Note]
  • DuVisor: a User-level Hypervisor Through Delegated Virtualization[arxiv'22]: [Paper]
  • AvA: Accelerated Virtualization of Accelerators[ASPLOS'22]: [Paper]
  • Security and Performance in the Delegated User-level Virtualization[OSDI'23]: [Paper] [Note]
  • System Virtualization for Neural Processing Units[HotOS'23]: [Paper]
  • Nephele: Extending Virtualization Environments for Cloning Unikernel-based VMs[EuroSys'23]: [Paper] [Note]
  • Honeycomb: Secure and Efficient GPU Executions via Static Validation[OSDI'23]: [Paper] [Note]

RISC-V

  • A First Look at RISC-V Virtualization from an Embedded Systems Perspective[TC'21]: [Paper]
  • CVA6 RISC-V Virtualization: Architecture, Microarchitecture, and Design Space Exploration[arxiv'23]: [Paper]

About

My Paper Reading Lists and Notes.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published