A few points left to do for the Julia implementation. - [x] GPU kernels. - [ ] better generic kernels for single inputs using the optimized polynomial expressions for small L - [ ] ChainRules.jl compatibility - [ ] careful performance comparison and see where Julia is slower than C++ (or vice-versa). - [ ] multi-threaded kernels for larger input batches