

This enables the GPU load/store units to execute the instructions in the most efficient way. Note that this example demonstrates uniform access: all threads of each warp access elements of their own private array using the same index (even if this index is calculated dynamically at runtime). Figure 2: Kernel Profile experiment in NVVP on an NVIDIA Tesla K20, showing local loads for dynamically addressed elements of aprivate array. _global_ void kernel3(float * buf, int start_index)Ī Kernel Profile experiment confirms that each access to array a now results in a local load or store, as Figure 2 shows. Here is an example of a kernel where the compiler can’t resolve indices to constants even if it unrolls the loop. Your kernel should also have occupancy high enough to hide local memory access latencies.

Empiricially, a 4:1 to 8:1 ratio of math to memory operations should be enough, the exact number depends on your particular kernel and GPU architecture. Using local memory is slower than keeping array elements directly in registers, but if you have sufficient math instructions in your kernel and enough threads to hide the latency, the local load/store instructions may be a minor cost. Each thread has its own copy of any local array, and the compiler generates load and store instructions for array reads and writes, respectively. Logical local memory actually resides in global GPU memory. “Local” here doesn’t mean this memory is close to compute units, necessarily it means that it is local to each thread and not visible to other threads. When the compiler can’t resolve array indices to constants it must put private arrays into GPU local memory. Note that the array size must be an immediate numeric constant however you can define it via a #define or a template parameter to the kernel. In some cases the compiler can unroll the loop automatically without #pragma unroll. Figure 1: Kernel Profile experiment in NVVP on an NVIDIA Tesla K20, showing the Source-Assembly correspondence for an unrolled loop. See the screenshot of Kernel Profile experiment in Figure 1. Why 4 instructions when we have 5 additions? The compiler is smart enough to figure out that adding 0.0f to a is just a and it eliminates this instruction. For the unrolled loop above, the compiler is able to generate just 4 floating point add instructions, without any stores and loads.


GAME MAKER STUDIO PRO ARRAYS CODE
When you enable source line information in the binary by building the CUDA source files with the -lineinfo nvcc option lets the Visual Profiler show the correspondence between the CUDA C++ source code lines and the generated assembler instructions. sum += a Īll the indices are now constants, so the compiler puts the whole array into registers. I ran a Kernel Profile experiment in the NVIDIA Visual Profiler. Here we tell the compiler to unroll the loop with the directive #pragma unroll, effectively replacing the loop with all the iterations listed explicitly, as in the following snippet. In the following code the compiler is also capable of assigning the accessed array elements to registers. This way array elements are accessed in the fastest way possible: math instructions use the data directly without loads and stores.Ī slightly more complex (and probably more useful) case is an unrolled loop over the indices of the array. For small arrays where all indices are known constants at compile time, as in the following sample code, the compiler places all accessed elements of the array into registers. Static indexingīefore discussing dynamic indexing let’s briefly look at static indexing. In this post I’ll cover several common scenarios ranging from fast static indexing to more complex and challenging use cases. The performance of accessing elements in these arrays can vary depending on a number of factors. Sometimes you need to use small per-thread arrays in your GPU kernels.
