Skip to content

Automating Memory Configuration

isaki001 edited this page Sep 1, 2020 · 1 revision

How memory resources utilized

  • Region Structure
  • In-kernel dynamic allocation for top-k error estimates
  • 128 regions and their index in global memory, stored in per-block shared memory

Regions Structure

Since in Phase II, there is a 1-to-1 mapping between a region and a thread block (as many regions as thread blocks), this is the most memory hungry structure with a default array size of 2048*32768.

Region attributes:

  • boundaries[2*NDIM]
  • estimate
  • error estimate
  • number of sub-divisions
  • next dim to split

Error Estimates

In Phase II, each block starts with a single region and proceeds until reaching a configurable maximum (currently 2048 regions). Every time the 128 region-size shared memory structure is filled, we must extract the 64 regions with the highest error estimate for sorting purposes (we use the shared-memory array as a cache for the global memory). To more efficiently do the sorting operation, we copy the error-estimates to a new location and then perform a parallel reduction 64 times. If the number of current regions in the block requires a size bigger than the amount of available shared memory, it needs to be dynamically allocated. We need to have enough global memory to allow each block to allocate the maximum amount in order to guarantee the lack of a crash.

Additional Structures

It is worth noting that we also need to allocate structures that hold the following:

  • each region's estimate
  • error-estimate
  • local convergence status
  • local number of regions generated

By retaining the above information in separate structures, instead of encapsulating it in a class, we can perform very fast reduction operations that efficiently provide the corresponding cumulative values (total estimate, total error estimate, etc.)

Configuring for different hardware

We rely on querying the Cuda API's cudaMemGetInfo function, which returns the available and total amount of memory in the device, in bytes.

The target size of Phase I (numBlocks), was defined as a global variable with the value 16384 surely due to experiments on pre-Volta architectures. Once the number of regions exceeds this maximum, we stop sub-dividing and proceed to Phase II if necessary. This means that numblocks can be between 16384 and 32768. The array of regions in device global memory, is allocated in the Kernel's constructor based on this value.

The maximum number of allowed sub-divisions (max_globalpool_size) in each Phase II block, was previously declared as a pre-processor macro with the value of 2048. This value can affect the performance since these aforementioned sub-divisions occur sequentially. This is the variable that enforces a region limit. The greater the number of regions allowed, the better chance of achieving local-convergence but also increases the execution time as more regions need to be sorted.

We try to predict the total amount of device memory that will be used based on the aforementioned structures. We then try to increase the Phase I target size to the largest power of 2, as long as the sum of all structure sizes does not exceed the value returned by cudaMemGetInfo.

Alternatively

We can try to increase max_globalpool_size in multiples of 128 without exceeding the available device memory.

The value of numBlocks and max_globalpool_size will differ based on the amount of available memory but also based on the integral's dimensionality. For the V100 GPU, an 8D integral retains the default values, while for 5D, the two approaches differ significantly. Configuring for the number of parallel blocks, allows us to double the default value, but this also yields a decrease in accuracy. Configuring for the depth however, allows us to double the depth in the 5D case.

Clone this wiki locally