NVIDIA
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 61 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 61 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 71 additions & 6 deletions b/‎README.md‎
Lines changed: 71 additions & 6 deletions
diff --git a/‎VERSION.md‎
Lines changed: 1 addition & 1 deletion b/‎VERSION.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/_build/html/.buildinfo‎
Lines changed: 1 addition & 1 deletion b/‎docs/_build/html/.buildinfo‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/_build/html/_sources/modules/functions.rst.txt‎
Lines changed: 64 additions & 0 deletions b/‎docs/_build/html/_sources/modules/functions.rst.txt‎
Lines changed: 64 additions & 0 deletions
diff --git a/‎docs/_build/html/_sources/modules/runtime.rst.txt‎
Lines changed: 0 additions & 4 deletions b/‎docs/_build/html/_sources/modules/runtime.rst.txt‎
Lines changed: 0 additions & 4 deletions
@@ -20,3 +20,4 @@ examples/assets/.thumbs
 /docs/_build/doctrees
 /docs/_build/html/_static/fonts
 /warp_lang.egg-info
+exts/omni.warp/omni/warp/ogn/tests/usd
@@ -1,11 +1,72 @@
 # CHANGELOG
 
+## [0.4.3] - 2022-09-20
+
+- Update all samples to use GPU interop path by default
+- Fix for arrays > 2GB in length
+- Add support for per-vertex USD mesh colors with warp.render class
+
+## [0.4.2] - 2022-09-07
+
+- Register Warp samples to the sample browser in Kit
+- Add NDEBUG flag to release mode kernel builds
+- Fix for particle solver node when using a large number of particles
+- Fix for broken cameras in Warp sample scenes
+
+## [0.4.1] - 2022-08-30
+
+- Add geometry sampling methods, see `wp.sample_unit_cube()`, `wp.sample_unit_disk()`, etc
+- Add `wp.lower_bound()` for searching sorted arrays
+- Add an option for disabling code-gen of backward pass to improve compilation times, see `wp.set_module_options({"enable_backward": False})`, True by default
+- Fix for using Warp from Script Editor or when module does not have a `__file__` attribute
+- Fix for hot reload of modules containing `wp.func()` definitions
+- Fix for debug flags not being set correctly on CUDA when `wp.config.mode == "debug"`, this enables bounds checking on CUDA kernels in debug mode
+- Fix for code gen of functions that do not return a value
+
+
+## [0.4.0] - 2022-08-09
+
+- Fix for FP16 conversions on GPUs without hardware support
+- Fix for `runtime = None` errors when reloading the Warp module
+- Fix for PTX architecture version when running with older drivers, see `wp.config.ptx_target_arch`
+- Fix for USD imports from `__init__.py`, defer them to individual functions that need them
+- Fix for robustness issues with sign determination for `wp.mesh_query_point()`
+- Fix for `wp.HashGrid` memory leak when creating/destroying grids
+- Add CUDA version checks for toolkit and driver
+- Add support for cross-module `@wp.struct` references
+- Support running even if CUDA initialization failed, use `wp.is_cuda_available()` to check availability
+- Statically linking with the CUDA runtime library to avoid deployment issues
+  
+
+### Breaking Changes
+
+- Removed `wp.runtime` reference from the top-level module, as it should be considered private
+
+
+## [0.3.2] - 2022-07-19
+
+- Remove Torch import from `__init__.py`, defer import to `wp.from_torch()`, `wp.to_torch()`
+
 
 ## [0.3.1] - 2022-07-12
 
 - Fix for marching cubes reallocation after initialization
 - Add support for closest point between line segment tests, see `wp.closest_point_edge_edge()` builtin
 - Add support for per-triangle elasticity coefficients in simulation, see `wp.sim.ModelBuilder.add_cloth_mesh()`
+- Add support for specifying default device, see `wp.set_device()`, `wp.get_device()`, `wp.ScopedDevice`
+- Add support for multiple GPUs (e.g., `"cuda:0"`, `"cuda:1"`), see `wp.get_cuda_devices()`, `wp.get_cuda_device_count()`, `wp.get_cuda_device()`
+- Add support for explicitly targeting the current CUDA context using device alias `"cuda"`
+- Add support for using arbitrary external CUDA contexts, see `wp.map_cuda_device()`, `wp.unmap_cuda_device()`
+- Add PyTorch device aliasing functions, see `wp.device_from_torch()`, `wp.device_to_torch()`
+
+
+### Breaking Changes
+
+- A CUDA device is used by default, if available (aligned with `wp.get_preferred_device()`)
+- `wp.ScopedCudaGuard` is deprecated, use `wp.ScopedDevice` instead
+- `wp.synchronize()` now synchronizes all devices; for finer-grained control, use `wp.synchronize_device()`
+- Device alias `"cuda"` now refers to the current CUDA context, rather than a specific device like `"cuda:0"` or `"cuda:1"`
+
 
 ## [0.3.0] - 2022-07-08
 
 
@@ -32,7 +32,6 @@ import numpy as np
 wp.init()
 
 num_points = 1024
-device = "cuda"
 
 @wp.kernel
 def length(points: wp.array(dtype=wp.vec3),
@@ -46,14 +45,13 @@ def length(points: wp.array(dtype=wp.vec3),
 
 
 # allocate an array of 3d points
-points = wp.array(np.random.rand(num_points, 3), dtype=wp.vec3, device=device)
-lengths = wp.zeros(num_points, dtype=float, device=device)
+points = wp.array(np.random.rand(num_points, 3), dtype=wp.vec3)
+lengths = wp.zeros(num_points, dtype=float)
 
 # launch kernel
 wp.launch(kernel=length,
           dim=len(points),
-          inputs=[points, lengths],
-          device=device)
+          inputs=[points, lengths])
 
 print(lengths)
 ```
@@ -107,7 +105,7 @@ Enabling the extension will automatically install and initialize the Warp Python
 
 Please see the following resources for additional background on Warp:
 
-* [GTC 2022 Presentation](https://www.nvidia.com/gtc/session-catalog/?search=warp&tab.scheduledorondemand=1583520458947001NJiE&search=warp#/session/16384065793850013gEx).
+* [GTC 2022 Presentation](https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41599/).
 * [GTC 2021 Presentation](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31838/)
 * [SIGGRAPH Asia 2021 Differentiable Simulation Course](https://dl.acm.org/doi/abs/10.1145/3476117.3483433)
 
@@ -133,6 +131,73 @@ howpublished = {\url{https://github.com/nvidia/warp}}
 }
 ```
 
+## FAQ
+
+### How does Warp relate to other Python projects for GPU programming, e.g.: Numba, Taichi, cuPy, PyTorch, etc?
+-------
+
+Warp is inspired by many of these projects, and is closely related to Numba and Taichi which both expose kernel programming to Python. These frameworks map to traditional GPU programming models, so many of the  high-level concepts are similar, however there are some functionality and implementation differences.
+
+Compared to Numba, Warp supports a smaller subset of Python, but offering auto-differentiation of kernel programs, which is useful for machine learning. Compared to Taichi Warp uses C++/CUDA as an intermediate representation, which makes it convenient to implement and expose low-level routines. In addition, we are building in datastructures to support geometry processing (meshes, sparse volumes, point clouds, USD data) as first-class citizens that are not exposed in other runtimes.
+
+Warp does not offer a full tensor-based programming model like PyTorch and JAX, but is designed to work well with these frameworks through data sharing mechanisms like `__cuda_array_interface__`. For computations that map well to tensors (e.g.: neural-network inference) it makes sense to use these existing tools. For problems with a lot of e.g.: sparsity, conditional logic, hetergenous workloads (like the ones we often find in simulation and graphics), then the kernel-based programming model like the one in Warp are often more convenient since users have control over individual threads.
+
+### Does Warp support all of the Python language?
+-------
+
+No, Warp supports a subset of Python that maps well to the GPU. Our goal is to not have any performance cliffs so that users can expect consistently good behavior from kernels that is close to native code. Examples of unsupported concepts that don't map well to the GPU are dynamic types, list comprehensions, exceptions, garbage collection, etc.
+
+### When should I call `wp.synchronize()`?
+-------
+
+One of the common sources of confusion for new users is when calls to `wp.synchronize()` are necessary. The answer is "almost never"! Synchronization is quite expensive, and should generally be avoided unless necessary. Warp naturally takes care of synchronization between operations (e.g.: kernel launches, device memory copies). 
+
+For example, the following requires no manual synchronization, as the conversion to NumPy will automatically synchronize:
+
+```python
+# run some kernels
+wp.launch(kernel_1, dim, [array_x, array_y], device="cuda")
+wp.launch(kernel_2, dim, [array_y, array_z], device="cuda")
+
+# bring data back to host (and implicitly synchronize)
+x = array_z.numpy()
+```
+
+The _only_ case where manual synchronization is needed is when copies are being performed back to CPU asynchronously, e.g.:
+
+```python
+# copy data back to cpu from gpu, all copies will happen asynchronously to Python
+wp.copy(cpu_array_1, gpu_array_1)
+wp.copy(cpu_array_2, gpu_array_2)
+wp.copy(cpu_array_3, gpu_array_3)
+
+# ensure that the copies have finished
+wp.synchronize()
+
+# return a numpy wrapper around the cpu arrays, note there is no implicit synchronization here
+a1 = cpu_array_1.numpy()
+a2 = cpu_array_2.numpy()
+a3 = cpu_array_3.numpy()
+```
+
+### What happens when you differentiate a function like `wp.abs(x)`?
+-------
+
+Non-smooth functions such as `y=|x|` do not have a single unique gradient at `x=0`, rather they have what is known as a `subgradient`, which is formally the convex hull of directional derivatives at that point. The way that Warp (and most auto-differentiation framworks) handles these points is to pick an arbitrary gradient from this set, e.g.: for `wp.abs()`, it will arbitrarily choose the gradient to be 1.0 at the origin. You can find the implementation for these functions in `warp/native/builtin.h`.
+
+Most optimizers (particularly ones that exploit stochasticity), are not sensitive to the choice of which gradient to use from the subgradient, although there are exceptions.
+
+### Does Warp support multi-GPU programming?
+-------
+
+Yes! Since verrsion `0.4.0` we support allocating, launching, and copying between multiple GPUs in a single process. We follow the naming conventions of PyTorch and use aliases such as `cuda:0`, `cuda:1`, `cpu` to identify individual devices.
+
+### Should I switch to Warp over IsaacGym / PhysX?
+-------
+
+Warp is not a replacement for IsaacGym, IsaacSim, or PhysX - while Warp does offer some physical simulation capabilities this is primarily aimed at developers who need differentiable physics, rather than a fully featured physics engine. Warp is also integrated with IsaacGym and is great for performing auxilary tasks such as reward and observation computations for reinforcement learning.
+
+
 
 ## Discord
 
 
@@ -1 +1 @@
-0.3.1
+0.4.3
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: a97feefd983a9ac1f4b950d108d7805f
+config: 731bc04fbe09dea8a8c117b6dc1bef90
 tags: 645f666f9bcd5a90fca523b33c5a78b7
@@ -1163,6 +1163,56 @@ Random
    Sample a normal distribution
 
 
+.. function:: sample_cdf(state: uint32, cdf: array[float32]) -> int
+
+   Inverse transform sample a cumulative distribution function
+
+
+.. function:: sample_triangle(state: uint32) -> vec2
+
+   Uniformly sample a triangle. Returns sample barycentric coordinates
+
+
+.. function:: sample_unit_ring(state: uint32) -> vec2
+
+   Uniformly sample a ring in the xy plane
+
+
+.. function:: sample_unit_disk(state: uint32) -> vec2
+
+   Uniformly sample a disk in the xy plane
+
+
+.. function:: sample_unit_sphere_surface(state: uint32) -> vec3
+
+   Uniformly sample a unit sphere surface
+
+
+.. function:: sample_unit_sphere(state: uint32) -> vec3
+
+   Uniformly sample a unit sphere
+
+
+.. function:: sample_unit_hemisphere_surface(state: uint32) -> vec3
+
+   Uniformly sample a unit hemisphere surface
+
+
+.. function:: sample_unit_hemisphere(state: uint32) -> vec3
+
+   Uniformly sample a unit hemisphere
+
+
+.. function:: sample_unit_square(state: uint32) -> vec2
+
+   Uniformly sample a unit square
+
+
+.. function:: sample_unit_cube(state: uint32) -> vec3
+
+   Uniformly sample a unit cube
+
+
 .. function:: noise(state: uint32, x: float32) -> float
 
    Non-periodic Perlin-style noise in 1d.
@@ -1220,6 +1270,20 @@ Random
 
 
 
+Other
+---------------
+.. function:: lower_bound(arr: array[int32], value: int32) -> int
+
+   Search a sorted array for the closest element greater than or equal to value.
+
+
+.. function:: lower_bound(arr: array[float32], value: float32) -> int
+
+   Search a sorted array for the closest element greater than or equal to value.
+
+
+
+
 Operators
 ---------------
 .. function:: add(x: int32, y: int32) -> int
 
@@ -163,10 +163,6 @@ Users can write their own functions using the ``wp.func`` decorator, for example
 
 User functions can be called freely from within kernels inside the same module and accept arrays as inputs. 
 
-.. note:: 
-   Currently user functions must be defined in the same module as the kernels that call them.
-
-
 Data Types
 ----------