Skip to content

Commit

Permalink
Unwrap lines and clean up lists and spacing
Browse files Browse the repository at this point in the history
The lines were wrapped inconsistently; some things that should have
been lists were not, and some things that should not have been were.
Let's fix that.
  • Loading branch information
traviscross committed Jun 28, 2024
1 parent 9f8a976 commit fcd4998
Showing 1 changed file with 26 additions and 42 deletions.
68 changes: 26 additions & 42 deletions src/2024h2/Rust-for-SciComp.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,25 +8,11 @@

## Motivation

Scientific Computing, High Performance Computing, and Machine Learning share the interesting challenge that they (to different degrees)
care about (very) efficient library and algorithm implementations, but are not always used by people experienced in computer science.
Rust is in a nice position because Ownership, Lifetimes, and the strong type system can prevent a descent amount of bugs. At the same
time strong alias information allows very nice performance optimizations in these fields, with performance gains well beyond what you
see in normal C++ vs. Rust Performance Comparisons. This is the case because these fields often use Automatic Differentiation and GPU/Co-Processor
offloading, both cases which benefit strongly from knowing which pointers or references alias.
Scientific Computing, High Performance Computing, and Machine Learning share the interesting challenge that they (to different degrees) care about (very) efficient library and algorithm implementations, but are not always used by people experienced in computer science. Rust is in a nice position because Ownership, Lifetimes, and the strong type system can prevent a descent amount of bugs. At the same time strong alias information allows very nice performance optimizations in these fields, with performance gains well beyond what you see in normal C++ vs. Rust Performance Comparisons. This is the case because these fields often use Automatic Differentiation and GPU/Co-Processor offloading, both cases which benefit strongly from knowing which pointers or references alias.

### The status quo

Rust has an excellent Python InterOp Story thanks to PyO3. C++ has a weak interop story and as such I've seen cases where a slower C library is
used as the backend for some Python libraries, because it was easier to bundle. Fortran is mostly used in legacy places and hardly used for new
projects. As a solution, many researchers try to limit themself to features which are offered by compilers and libraries build on top of Python,
like JAX, PyTorch, or newly Mojo. Rust has a lot of features which make it more suitable to develop a fast and reliable backend for performance
critical software than those languages. However, it lacks features which developers now got used to. These features are *trivial GPU usage*.
Almost every language has some way of calling hand-written CUDA/ROCm/Sycl Kernels, but the interesting feature of languages like Julia, or libraries
like JAX is that they offer users to write Kernels in a (subset) of their already known language, without having to learn anything new. Minor
performance penalties are not that critical in such cases, if the alternative are CPU only solution, because projects like Rust-CUDA end up being unmaintained
due to being too much effort to maintain outside of the LLVM or Rust project.

Rust has an excellent Python InterOp Story thanks to PyO3. C++ has a weak interop story and as such I've seen cases where a slower C library is used as the backend for some Python libraries, because it was easier to bundle. Fortran is mostly used in legacy places and hardly used for new projects. As a solution, many researchers try to limit themself to features which are offered by compilers and libraries build on top of Python, like JAX, PyTorch, or newly Mojo. Rust has a lot of features which make it more suitable to develop a fast and reliable backend for performance critical software than those languages. However, it lacks features which developers now got used to. These features are *trivial GPU usage*. Almost every language has some way of calling hand-written CUDA/ROCm/Sycl Kernels, but the interesting feature of languages like Julia, or libraries like JAX is that they offer users to write Kernels in a (subset) of their already known language, without having to learn anything new. Minor performance penalties are not that critical in such cases, if the alternative are CPU only solution, because projects like Rust-CUDA end up being unmaintained due to being too much effort to maintain outside of the LLVM or Rust project.

*Elaborate in more detail about the problem you are trying to solve. This section is making the case for why this particular problem is worth prioritizing with project bandwidth. A strong status quo section will (a) identify the target audience and (b) give specifics about the problems they are facing today. Sometimes it may be useful to start sketching out how you think those problems will be addressed by your change, as well, though it's not necessary.*

Expand All @@ -38,13 +24,12 @@ due to being too much effort to maintain outside of the LLVM or Rust project.

### The "shiny future" we are working towards

All three proposed features (batching, autodiff, offloading) can be combined and work nicely together. We have State-of-the-art libraries like faer to cover linear
algebra and we start to see more and more libraries in other languages use Rust with these features as their backend. Cases which don't require interactive exploration also become more popular in pure Rust.
All three proposed features (batching, autodiff, offloading) can be combined and work nicely together. We have State-of-the-art libraries like faer to cover linear algebra and we start to see more and more libraries in other languages use Rust with these features as their backend. Cases which don't require interactive exploration also become more popular in pure Rust.

## Design axioms


### Offloading

- We try to provide a safe, simple and opaque offloading interface.
- The "unit" of offloading is a function.
- We try to not expose explicit data movement if Ownership gives us enough information.
Expand All @@ -54,6 +39,7 @@ algebra and we start to see more and more libraries in other languages use Rust
- Offloaded code might not return exact same values as code executed on the CPU. We will work with t-(opsem?) to develop clear rules.

### Autodiff

- We try to provide a fast autodiff interface which supports most autodiff features relevant for Scientific Computing.
- The "unit" of autodiff is a function.
- We acknowledge our responability since user-implemented autodiff without compiler knowledge might struggle to cover gaps in our features.
Expand All @@ -62,8 +48,6 @@ algebra and we start to see more and more libraries in other languages use Rust
- We do not support differentiating (inline) assembly. Users are expected to write Custom-Derivatives in such cases.
- We might refuse to expose certain features if they are too hard to use correctly and provide little gains (e.g. derivatives with respect to global vars).



*Add your [design axioms][da] here. Design axioms clarify the constraints and tradeoffs you will use as you do your design work. These are most important for project goals where the route to the solution has significant ambiguity (e.g., designing a language feature or an API), as they communicate to your reader how you plan to approach the problem. If this goal is more aimed at implementation, then design axioms are less important. [Read more about design axioms][da].*

[da]: ../about/design_axioms.md
Expand All @@ -72,15 +56,9 @@ algebra and we start to see more and more libraries in other languages use Rust

**Owner:** ZuseZ4 / Manuel S. Drehwald

Manuel S. Drehwald working 5 days/wk, sponsored by LLNL and the University of Toronto (UofT).
He has a background in HPC and worked on a rust compiler fork, as well as an LLVM based autodiff tool
for the last 3 years during his undergrad. He is now in a research based Master Program.
Supervision and Discussion on the LLVM side with Johannes Doerfert and Tom Scogland.
Manuel S. Drehwald working 5 days/wk, sponsored by LLNL and the University of Toronto (UofT). He has a background in HPC and worked on a rust compiler fork, as well as an LLVM based autodiff tool for the last 3 years during his undergrad. He is now in a research based Master Program. Supervision and Discussion on the LLVM side with Johannes Doerfert and Tom Scogland.

Resources:
Domain and CI for the autodiff work provided by MIT. Might be moved to the LLVM org later this year.
Hardware for Benchmarks provided by LLNL and UofT.
CI for the offloading work provided by LLNL or LLVM(?, see below).
Resources: Domain and CI for the autodiff work provided by MIT. Might be moved to the LLVM org later this year. Hardware for Benchmarks provided by LLNL and UofT. CI for the offloading work provided by LLNL or LLVM(?, see below).

### Support needed from the project

Expand All @@ -92,10 +70,13 @@ CI for the offloading work provided by LLNL or LLVM(?, see below).

### Outputs

An `#[Offload]` rustc-builtin-macro which makes a function definition known to the LLVM offloading backend.
A bikeshead `offload!([GPU1, GPU2, TPU1], foo(x, y,z));` macro which will execute function foo on the specified devices.
An `#[Autodiff]` rustc-builtin-macro which differentiates a given function.
A `#[Batching]` rustc-builtin-macro which fuses N function calls into one call, enabling better vectorization.
- An `#[Offload]` rustc-builtin-macro which makes a function definition known to the LLVM offloading backend.

- A bikeshead `offload!([GPU1, GPU2, TPU1], foo(x, y,z));` macro which will execute function foo on the specified devices.

- An `#[Autodiff]` rustc-builtin-macro which differentiates a given function.

- A `#[Batching]` rustc-builtin-macro which fuses N function calls into one call, enabling better vectorization.

### Milestones

Expand All @@ -112,25 +93,28 @@ A `#[Batching]` rustc-builtin-macro which fuses N function calls into one call,
## Frequently asked questions

### Why do you implement these features only on the LLVM Backend?
- Performance wise we have LLVM and GCC as performant backends. Modularity wise we have LLVM and especially Cranelift being nice to modify.
It seems reasonable that LLVM thus is the first backend to have support for new features in this field. Especially the offloading support
should be supportable by other compiler backends, given pre-existing work like OpenMP Offloading and WebGPU.

Performance wise we have LLVM and GCC as performant backends. Modularity wise we have LLVM and especially Cranelift being nice to modify. It seems reasonable that LLVM thus is the first backend to have support for new features in this field. Especially the offloading support should be supportable by other compiler backends, given pre-existing work like OpenMP Offloading and WebGPU.

### Do these changes have to happen in the compiler?
- No! Both features could be implemented in user-space, if the Rust compiler would support Reflection. In this case I could ask the compiler for the optimized backend IR for a given function. I would then need use either the AD or Offloading abilities of the LLVM library to modify the IR, generating a new function. The user would then be able to call that newly generated function. This would require some discussion on how we can have crates in the ecosystem that work with various LLVM versions, since crates are usually expected to have a MSRV, but the LLVM (and like GCC/Cranelift) backend will have breaking changes, unlike Rust.

No! Both features could be implemented in user-space, if the Rust compiler would support Reflection. In this case I could ask the compiler for the optimized backend IR for a given function. I would then need use either the AD or Offloading abilities of the LLVM library to modify the IR, generating a new function. The user would then be able to call that newly generated function. This would require some discussion on how we can have crates in the ecosystem that work with various LLVM versions, since crates are usually expected to have a MSRV, but the LLVM (and like GCC/Cranelift) backend will have breaking changes, unlike Rust.

### Batching?
- Offered by all autodiff tools, JAX has an extra command for it, whereas Enzyme (the autodiff backend) combines Batching with AutoDiff. We might want to split these since both have value on their own. Some libraries also offer Array-of-Struct vs Struct-of-Array features which are related but often have limited usability or performance when implemented in userspace. To be fair this is a less mature feature of Enzyme, so I could understand concerns. However, following the Autodiff work this feature can be exposed in very few (100?) loc. My main bieksheding thoughts where about whether we want to pass 3 batched args as [x1, x2, x3], (x1, x2, x3), or x1, x2, x3. Also it's a nice feature to get something started, once the main autodiff PR got merged.

Offered by all autodiff tools, JAX has an extra command for it, whereas Enzyme (the autodiff backend) combines Batching with AutoDiff. We might want to split these since both have value on their own. Some libraries also offer Array-of-Struct vs Struct-of-Array features which are related but often have limited usability or performance when implemented in userspace. To be fair this is a less mature feature of Enzyme, so I could understand concerns. However, following the Autodiff work this feature can be exposed in very few (100?) loc. My main bieksheding thoughts where about whether we want to pass 3 batched args as [x1, x2, x3], (x1, x2, x3), or x1, x2, x3. Also it's a nice feature to get something started, once the main autodiff PR got merged.

### Writing a GPU Backend in 6 months sounds tough..
- True. But similar to the Autodiff work I'm exposing something that's already existing in the Backend. I just don't think that Rust, Julia, C++, Carbon, Fortran, Chappel, Haskell, Bend, Python, ... should all write their own GPU or Autodiff Backends. Most of these already share compiler optimization through LLVM or GCC, so let's also share this. Of course, we should still push to use our Rust specific magic.

True. But similar to the Autodiff work I'm exposing something that's already existing in the Backend. I just don't think that Rust, Julia, C++, Carbon, Fortran, Chappel, Haskell, Bend, Python, ... should all write their own GPU or Autodiff Backends. Most of these already share compiler optimization through LLVM or GCC, so let's also share this. Of course, we should still push to use our Rust specific magic.

### Rust Specific Magic?
TODO:

TODO

### How about Safety?
I want all these features to be safe by default, and I am happy to not expose some features if the gain is too small for the safety risk.
As an Example, Enzyme can compute the derivative with respect to a global. Too niche, discouraged (and unsafe) for Rust. `¯\_(ツ)_/¯`

I want all these features to be safe by default, and I am happy to not expose some features if the gain is too small for the safety risk. As an Example, Enzyme can compute the derivative with respect to a global. Too niche, discouraged (and unsafe) for Rust. `¯\_(ツ)_/¯`

How to parallelize your 3 nested for loops efficiently has been researched for decades. Lately there also has been some more work on how to translate different parallelism type efficiently, e.g. from GPUs to CPUs, or now maybe some rayon parllelism to GPUs? I am therefore not particualrily worried about Correctness.

Expand Down

0 comments on commit fcd4998

Please sign in to comment.