- Categorical Foundations for CuTe Layouts [Paper]
- Understanding Latency Hiding on GPUs [Paper]
- The Deep Learning Compiler: A Comprehensive Survey: [Paper] [Note]
- Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks[OSDI'20]: [Paper] [Note]
- ROLLER: Fast and Efficient Tensor Compilation for Deep Learning[OSDI'22]: [Paper] [Note]
- BOLT: Brinding The Gap Between Auto-Tunners and Hardware-Native Performance[MLSys'22]: [Paper] [Note]
- AStitch: Enabling a New Multi-dimensional Optimization Space for Memory-Intensive ML Training and Inference on Modern SIMT Architectures[ASPLOS'22]: [Paper] [Note]
- Welder: Scheduling Deep Learning Memory Access via Tile-graph[OSDI'23]: [Paper] [Note]
- Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators[OSDI'23]: [Paper]
- Cocktailer: Analyzing and Optimizing Dynamic Control Flow in Deep Learning[OSDI'23]: [Paper] [Note]
- Chimera: An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion[HPCA'23]: [Paper] [Note]
- MLIR: Scaling Compiler Infrastructure for Domain Specifific Computation[CGO'21]: [Paper] [Note]
- Graphene: An IR for Optimized Tensor Computations on GPUs[ASPLOS'23]: [Paper] [Note]
- TIRAMISU: A Polyhedral Compiler for Expressing Fast and Portable Code[CGO'19]: [Paper] [Note]
- AMOS: Enabling Automatic Mapping for Tensor Computations On Spatial Accelerators with Hardware Abstraction[ISCA'22]: [Paper] [Note]
- Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor[SOSP'24] [Paper]
- ThunderKittens: Simple, Fast, and Adorable AI Kernels [Paper]
- Mirage: A Multi-Level Superoptimizer for Tensor Programs[OSDI'25] [Paper]
- PipeThreader: Software-Defined Pipelining for Efficient DNN Execution[OSDI'25] [Paper]
- KPerfIR: Towards a Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads[OSDI'25] [Paper]
- TileLang: A Composable Tiled Programming Model for AI Systems [Paper]
- WaferLLM: Large Language Model Inference at Wafer Scale [OSDI'25] [Paper]
- Training-Free Long-Context Scaling of Large Language Models[ICML'24]: [Paper] [Note]
- Efficient Streaming Language Models with Attention Sinks[ICLR'24]: [Paper]
- Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference[ICML'24]: [Paper]
- DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads[ICLR'25]: [Paper]
- SGLang: Efficient Execution of Structured language Model Programs [Paper]
- FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving [Paper]
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving[OSDI'24] [Paper]
- LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism[SOSP'24] [Paper]
- Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot[FAST'25] [Paper]
- NanoFlow: Towards Optimal Large Language Model Serving Throughput[OSDI'25] [Paper]
- A Suvery of LLM Inference Systems [Paper] [Note]
- Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B [Paper]
- Compiling LLMs into a MegaKernel: A Path to Low-Latency Inference [Paper]
- LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism [Paper]
- Flux: Fast Software-based Communication Overlap on GPUs through Kernel Fusion [Paper]
- Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts [Paper]
- Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning[ASPLOS'24 Best Paper] [Paper]
- TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives[MLSys'25] [Paper]
- Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler [Paper]
- FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation[Eurosys'25] [Paper]
- TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference [Paper]
- A Survey of Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention [Paper]
- On-Device Training Under 256KB Memory[NeurIPS'22]: [Paper]
- PockEngine: Sparse and Efficient Fine-tuning in a Pocket[MICRO'23]: [Paper] [Note]
- Attention is all you need[NIPS'17]: [Paper] [Note]
- Big bird: transformers for longer sequences[NIPS'20]: [Paper] [Note]
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness[NIPS'22]: [Paper] [Note]
- FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning[Arxiv]: [Paper] [Note]
- Flash-Decoding for long-context inference[Blog]: [Paper] [Note]
- Honeycomb: Secure and Efficient GPU Executions via Static Validation[OSDI'23]: [Paper] [Note]
- HIDA: A Hierarchical Dataflow Compiler for High-Level Synthesis[ASPLOS'24]: [Paper] [Note]
- RedLeaf: Isolation and Communication in a Safe Operating System[OSDI'20]: [Paper] [Note]
- Theseus: an Experiment in Operating System Structure and State Management[OSDI'20]: [Paper]
- Unikraft: Fast, Specialized Unikernels the Easy Way[EuroSys'21]: [Paper] [Note]
- The Deminkernel Datapath OS Architecture for Microsecond-scale Datacenter Systems[SOSP'21]: [Paper] [Note]
- HyperBench: A Benchmark Suite for Virtualization Capabilities: [Paper] [Note]
- DuVisor: a User-level Hypervisor Through Delegated Virtualization[arxiv'22]: [Paper]
- AvA: Accelerated Virtualization of Accelerators[ASPLOS'22]: [Paper]
- Security and Performance in the Delegated User-level Virtualization[OSDI'23]: [Paper] [Note]
- System Virtualization for Neural Processing Units[HotOS'23]: [Paper]
- Nephele: Extending Virtualization Environments for Cloning Unikernel-based VMs[EuroSys'23]: [Paper] [Note]
- Honeycomb: Secure and Efficient GPU Executions via Static Validation[OSDI'23]: [Paper] [Note]