diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md new file mode 100644 index 00000000..12790d1f --- /dev/null +++ b/.github/ISSUE_TEMPLATE/bug_report.md @@ -0,0 +1,44 @@ +--- +name: Bug Report +about: Report a bug to help us improve +title: "[Bug] " +labels: bug +assignees: '' +--- + +## Description + +A clear and concise description of the bug. + +## Steps to Reproduce + +Minimal code to reproduce the issue: + +```python +# Your reproduction code here +``` + +## Expected Behavior + +What you expected to happen. + +## Actual Behavior + +What actually happened. + +## Environment + +- Python version: +- JAX version: +- Hardware: CPU / GPU (model) / TPU (version) +- OS: + +## Logs / Error Messages + +``` +Paste relevant logs or error messages here. +``` + +## Additional Context + +Any other context about the problem. diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md new file mode 100644 index 00000000..ae4040cd --- /dev/null +++ b/.github/ISSUE_TEMPLATE/feature_request.md @@ -0,0 +1,27 @@ +--- +name: Feature Request +about: Suggest a new feature or enhancement +title: "[Feature] " +labels: enhancement +assignees: '' +--- + +## Motivation + +Why is this feature needed? What problem does it solve? + +## Expected Behavior + +Describe the desired functionality. + +## Proposed Approach + +(Optional) How might this be implemented? + +## Willingness to Contribute + +- [ ] I am willing to submit a PR for this feature. + +## Additional Context + +Any other context, references, or related issues. diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md new file mode 100644 index 00000000..8bd24ce8 --- /dev/null +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -0,0 +1,31 @@ +## Description + +What does this PR do and why? + +## Related Issue + +Closes # + +## Change Type + +- [ ] `feat` — New feature +- [ ] `fix` — Bug fix +- [ ] `refactor` — Code refactoring +- [ ] `docs` — Documentation +- [ ] `ci` — CI/CD changes +- [ ] `test` — Tests +- [ ] `perf` — Performance improvement + +## Checklist + +- [ ] Code passes `uv run ruff check src/ tests/` and `uv run ruff format src/ tests/` +- [ ] New/modified public APIs have complete docstrings (tensor shapes, dimension semantics, business logic) +- [ ] Public functions have input assertions (`assert` or `assert_shape_or_none`) +- [ ] Tests added at the appropriate layer (`tests/ops/`, `tests/modules/`, `tests/layers/`, or `tests/ref/`) +- [ ] If `tops/cpu/` is modified, core developers have been notified and PR is labeled `cpu-ref` + +## Test Results + +``` +Paste relevant test output here. +``` diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 00000000..e60710b3 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,278 @@ +# Contributing to tops + +Thank you for your interest in contributing to tops! This project implements high-performance JAX/Pallas TPU and GPU kernels for modern neural network architectures. + +For a detailed understanding of the system architecture, layer design, and coding standards, please read [ARCHITECTURE.md](ARCHITECTURE.md). + +## Getting Started + +### Prerequisites + +- Python >= 3.12 +- [uv](https://docs.astral.sh/uv/) package manager + +### Setup + +1. Clone the repository: + +```bash +git clone https://github.com/primatrix/pallas-kernel.git +cd pallas-kernel +``` + +2. Install dependencies: + +```bash +# Base install (CPU only) +uv sync + +# With GPU support (CUDA 12) +uv sync --extra gpu + +# With TPU support +uv sync --extra tpu +``` + +3. Install pre-commit hooks: + +```bash +pre-commit install +``` + +4. Run tests to verify your setup: + +```bash +uv run pytest tests/ -v +``` + +## Development Workflow + +### Branching + +1. Create your own development branch in this repository (e.g., `dev//my-feature`, `fix/issue-123`). +2. Make your changes and push to the branch. +3. CI will automatically run checks to ensure code passes lint, formatting, and tests. +4. Once CI passes, open a Pull Request against the `main` branch for code review. + +### Commit Convention + +We follow the [Conventional Commits](https://www.conventionalcommits.org/) format: + +``` +[optional scope]: +``` + +**Types:** + +| Type | Usage | +|------|-------| +| `feat` | New feature | +| `fix` | Bug fix | +| `refactor` | Code refactoring (no behavior change) | +| `docs` | Documentation only | +| `ci` | CI/CD changes | +| `test` | Adding or updating tests | +| `perf` | Performance improvement | + +**Scope** is optional and indicates the affected module: `feat(gla):`, `fix(pallas):`, etc. + +**Examples from project history:** + +- `ci: fix TPU job premature failure due to log disconnect` +- `fix(gla): relax g_gamma non-positive assertion` +- `feat: fused chunk simple gla vjp` + +### Pre-Submit Checks + +Before submitting a PR, ensure the following pass locally: + +```bash +# Lint +uv run ruff check src/ tests/ + +# Format +uv run ruff format src/ tests/ + +# Tests +uv run pytest tests/ -v +``` + +Pre-commit hooks will run `ruff check --fix`, `ruff format`, `trailing-whitespace`, and `check-added-large-files` automatically on each commit. + +## Code Review Process + +### General Changes + +Features, bug fixes, documentation, CI, and tests require **1 core maintainer approval** to merge. + +### Reference Implementation Changes + +Changes to code under `tops/cpu/` (the correctness baseline for all GPU/TPU kernel validation) require review by **all core developers**. This review must be conducted via: + +- Video conference discussion, OR +- Mailing list review + +PRs modifying `tops/cpu/` must be labeled with `cpu-ref` to notify all core developers. + +**Rationale:** `tops/cpu/` serves as the Golden reference for all kernel correctness validation. Changes here impact the entire test and validation system. + +### Review Process + +A complete PR review follows these steps: + +1. **Author self-check**: Ensure all CI checks pass (lint, format, tests). PR description clearly explains what changed and why. +2. **Reviewer examination**: + - **Correctness**: Is the logic correct? Are edge cases covered? + - **Coding standards**: Do docstrings, assertions, and naming follow project standards? + - **Test coverage**: Are tests added at the appropriate layer with sufficient coverage? + - **Performance impact**: Does this introduce unnecessary performance regressions? (For kernel PRs, check roofline analysis results.) +3. **Iterate**: After reviewer feedback, the author addresses comments and pushes fixes until all comments are resolved. +4. **Merge**: Once the required number of approvals is obtained, the reviewer or author merges into `main`. + +## Coding Standards + +Key rules (see [ARCHITECTURE.md Section 2](ARCHITECTURE.md#2-agent-readability) for full details): + +- **Docstrings**: All public functions must have comprehensive docstrings that explicitly describe tensor shapes, dimension semantics, and business logic. +- **Input assertions**: All public functions must enforce strict shape/type constraints using `assert` or `assert_shape_or_none` from `tops.utils`. +- **No-crash guarantee**: Kernels must never crash after passing entry-point assertions. + +## Testing Guidelines + +The project uses a layered testing structure (see [ARCHITECTURE.md Section 3](ARCHITECTURE.md#3-testing--validation-boundaries) for full details): + +| Directory | Purpose | +|-----------|---------| +| `tests/ops/` | Kernel correctness tests (GPU/TPU vs CPU reference) | +| `tests/modules/` | Module unit tests | +| `tests/layers/` | Layer integration tests | +| `tests/ref/` | Reference implementation comparison tests | + +Key requirements: + +- New kernels must have CPU reference comparison tests. +- Error bound constraint: TPU error must not exceed GPU error ($\epsilon_{\text{TPU}} \leq \epsilon_{\text{GPU}}$). +- Modifications at a specific layer must be verified by tests at that same layer. + +## Kernel Design Requirements + +Every new or refactored Kernel **must** have a design document in `docs/design/` covering the following: + +### 1. Algorithm and Code Logic + +- Core mathematical formulas (e.g., recurrence relations, attention computations). +- Implementation strategy: why this approach was chosen and trade-offs with alternatives. + +### 2. Data Flow and Tiling Logic + +- Tiling / Blocking strategy: rationale for block sizes, how they map to hardware (TPU VMEM / GPU SMEM). +- Data movement between HBM and VMEM/SMEM (DMA pipeline, double buffering, etc.). +- Grid dimension partitioning: which dimensions are parallel vs. arbitrary, and why. + +### 3. Performance Target and Roofline Analysis + +The performance target for all Kernels is **80% of hardware theoretical peak**: + +- If the Kernel is **memory bandwidth-bound**: target is 80% of hardware memory bandwidth peak. +- If the Kernel is **compute-bound**: target is 80% of hardware compute peak (FLOPS). + +The design document must include: + +- **Roofline analysis**: arithmetic intensity of the Kernel and whether it is bandwidth-bound or compute-bound. +- **Local benchmark results**: measured throughput on target hardware (e.g., TFLOPS, GB/s). +- **Trace viewer analysis**: include screenshots or links from a trace viewer (e.g., Perfetto / TensorBoard Profiler) showing the Kernel's actual execution timeline — compute, memory transfer, and synchronization phases with their time distribution — to verify pipeline effectiveness and identify bubbles or unnecessary stalls. +- **Whether the 80% target is met**. If not, the reason must be clearly stated: + - **Hardware limitation**: unsupported instructions, memory alignment constraints, known hardware bottlenecks, etc. + - **Code logic limitation**: tiling strategy constraints, pipeline bubbles, synchronization overhead, etc., along with future optimization directions. + +### 4. Precision and Error Analysis + +- **Error metrics**: error between Kernel output and `tops/cpu/` high-precision reference (atol, rtol, max ULP). +- **Gradient error** (if applicable): backward pass error metrics. +- **Test methodology**: input distributions used, shape configurations, hardware tested on. +- **Error bound constraint**: TPU error must not exceed GPU error ($\epsilon_{\text{TPU}} \leq \epsilon_{\text{GPU}}$). + +## Adding New Kernels + +1. Write a design document in `docs/design/` first. +2. Implement CPU reference in `tops/cpu/ops//`. +3. Submit CPU reference for **all-core-developer review** (see [Code Review Process](#reference-implementation-changes)). +4. Implement GPU/TPU kernel in `tops/ops//`. +5. Add comparison tests in `tests/ops//`. +6. Export public API via `tops/ops/__init__.py`. +7. Ensure docstrings and assertions meet coding standards. +8. Submit non-`tops/cpu/` code (steps 4-7), which requires **1 core maintainer approval** to merge. + +> **Note:** Due to the current small number of core developers, the tiered review policy above is enforced by convention only — it is not yet backed by GitHub branch protection rules or CODEOWNERS. Automated enforcement will be added as the team grows. + +## Release Process + +The project uses trunk-based development with release branches (see [ARCHITECTURE.md Section 4](ARCHITECTURE.md#4-release--versioning-policy) for full details): + +1. RC tag created on `main` when ready for delivery (`vX.Y.Z-rc.N`). +2. Release branch `release/vX.Y` pulled from RC tag. +3. Integration testing; fixes cherry-picked back to `main`. +4. Final tag (`vX.Y.Z`) when stable. + +## Profiling Guide + +### Installation + +Install profiling dependencies: + +```bash +uv sync --extra profile +``` + +This installs [xprof](https://github.com/google/xprof) (TensorBoard Profiler) for collecting and analyzing kernel execution traces on TPU/GPU. + +### Collecting Traces + +Use `jax.profiler` to capture traces for a specific code region: + +```python +import jax + +# Start profiler server +jax.profiler.start_server(port=9012) + +# Or manually mark a trace region +jax.profiler.start_trace("/tmp/my_trace") +# ... run your kernel ... +jax.profiler.stop_trace() +``` + +You can also connect to the profiler server remotely via TensorBoard for interactive trace capture. + +### Analyzing Traces + +```bash +# Launch TensorBoard to view traces +tensorboard --logdir /tmp/my_trace +``` + +In the TensorBoard **Profile** tab, focus on: + +- **Trace Viewer**: kernel execution timeline — check time distribution across compute, memory transfer, and synchronization phases. +- **Pipeline effectiveness**: look for bubbles, unnecessary waits, or synchronization overhead. +- **Op-level timing**: identify hotspot operations. + +Trace analysis results must be included in the [Kernel Design document](#3-performance-target-and-roofline-analysis). + +## Hardware Access + +The project provides [SkyPilot](https://skypilot.readthedocs.io/)-based cluster launch scripts: + +```bash +# Launch GPU cluster +./scripts/launch_gpu.sh L4 my-cluster + +# Launch TPU cluster +./scripts/launch_tpu.sh tpu-v6e-1 my-cluster +``` + +See the resource configuration files in `scripts/` (`gpu_resource.sky.yaml`, `tpu_resource.sky.yaml`) for details. You need a configured cloud account and SkyPilot environment before use. + +## Questions? + +If you have questions, open an issue on GitHub or reach out to the maintainers. \ No newline at end of file diff --git a/README.md b/README.md index 8734e0da..7ca5cec2 100644 --- a/README.md +++ b/README.md @@ -55,6 +55,10 @@ from tops.layers.gla import GatedLinearAttention from tops.modules.layernorm import RMSNorm ``` +## Contributing + +We welcome contributions! Please read our [Contributing Guide](CONTRIBUTING.md) ([中文版](CONTRIBUTING.zh.md)) to get started. + ## License Apache License 2.0