Paper Reading

Math

Categorical Foundations for CuTe Layouts [Paper]

GPU Microarchitecture

Understanding Latency Hiding on GPUs [Paper]

Deep Learning Compiler

LLM Inference

WaferLLM: Large Language Model Inference at Wafer Scale [OSDI'25] [Paper]

Long Context Inference

Training-Free Long-Context Scaling of Large Language Models[ICML'24]: [Paper] [Note]
Efficient Streaming Language Models with Attention Sinks[ICLR'24]: [Paper]
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference[ICML'24]: [Paper]
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads[ICLR'25]: [Paper]

LLM Serving

SGLang: Efficient Execution of Structured language Model Programs [Paper]
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving [Paper]
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving[OSDI'24] [Paper]
LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism[SOSP'24] [Paper]
Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot[FAST'25] [Paper]
NanoFlow: Towards Optimal Large Language Model Serving Throughput[OSDI'25] [Paper]
A Suvery of LLM Inference Systems [Paper] [Note]

MegaKernel

Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B [Paper]
Compiling LLMs into a MegaKernel: A Path to Low-Latency Inference [Paper]

LLM Training

Distributed Training

LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism [Paper]

Compute-Communication Overlap

Flux: Fast Software-based Communication Overlap on GPUs through Kernel Fusion [Paper]
Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts [Paper]
Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning[ASPLOS'24 Best Paper] [Paper]
TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives[MLSys'25] [Paper]
Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler [Paper]
FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation[Eurosys'25] [Paper]
TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference [Paper]

LLM Attention

A Survey of Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention [Paper]

LLM System for Mobile

On-Device Training Under 256KB Memory[NeurIPS'22]: [Paper]
PockEngine: Sparse and Efficient Fine-tuning in a Pocket[MICRO'23]: [Paper] [Note]

Deep Learning

Attention with Variants

Attention is all you need[NIPS'17]: [Paper] [Note]
Big bird: transformers for longer sequences[NIPS'20]: [Paper] [Note]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness[NIPS'22]: [Paper] [Note]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning[Arxiv]: [Paper] [Note]
Flash-Decoding for long-context inference[Blog]: [Paper] [Note]

New Architecture for LLM

Gated Linear Attention Transformers with Hardware-Efficient Training[arxiv]: [Paper] [Note]

Compiler

Honeycomb: Secure and Efficient GPU Executions via Static Validation[OSDI'23]: [Paper] [Note]
HIDA: A Hierarchical Dataflow Compiler for High-Level Synthesis[ASPLOS'24]: [Paper] [Note]

OS

RedLeaf: Isolation and Communication in a Safe Operating System[OSDI'20]: [Paper] [Note]
Theseus: an Experiment in Operating System Structure and State Management[OSDI'20]: [Paper]
Unikraft: Fast, Specialized Unikernels the Easy Way[EuroSys'21]: [Paper] [Note]
The Deminkernel Datapath OS Architecture for Microsecond-scale Datacenter Systems[SOSP'21]: [Paper] [Note]

Hypervisor

HyperBench: A Benchmark Suite for Virtualization Capabilities: [Paper] [Note]
DuVisor: a User-level Hypervisor Through Delegated Virtualization[arxiv'22]: [Paper]
AvA: Accelerated Virtualization of Accelerators[ASPLOS'22]: [Paper]
Security and Performance in the Delegated User-level Virtualization[OSDI'23]: [Paper] [Note]
System Virtualization for Neural Processing Units[HotOS'23]: [Paper]
Nephele: Extending Virtualization Environments for Cloning Unikernel-based VMs[EuroSys'23]: [Paper] [Note]
Honeycomb: Secure and Efficient GPU Executions via Static Validation[OSDI'23]: [Paper] [Note]

RISC-V

A First Look at RISC-V Virtualization from an Embedded Systems Perspective[TC'21]: [Paper]
CVA6 RISC-V Virtualization: Architecture, Microarchitecture, and Design Space Exploration[arxiv'23]: [Paper]

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
notes		notes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Paper Reading

Math

GPU Microarchitecture

Deep Learning Compiler

LLM Inference

Long Context Inference

LLM Serving

MegaKernel

LLM Training

Distributed Training

Compute-Communication Overlap

LLM Attention

LLM System for Mobile

Deep Learning

Attention with Variants

New Architecture for LLM

Compiler

OS

Hypervisor

RISC-V

About

Uh oh!

Releases

Packages

KuangjuX/Paper-reading

Folders and files

Latest commit

History

Repository files navigation

Paper Reading

Math

GPU Microarchitecture

Deep Learning Compiler

LLM Inference

Long Context Inference

LLM Serving

MegaKernel

LLM Training

Distributed Training

Compute-Communication Overlap

LLM Attention

LLM System for Mobile

Deep Learning

Attention with Variants

New Architecture for LLM

Compiler

OS

Hypervisor

RISC-V

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages