18-20. GPU-Based Programming Concepts

Shared Memory Programming

Works best when the access latency is small - hence, with small memory spaces.

A common workload which works on a small amount of memory and requires many processes is in graphics processing.

CPU:

----------------------
|         | ALU | ALU |
| Control |     |     |
          -------------
|         | ALU | ALU |
|         |     |     |
-----------------------
| Cache               |
|                     |
|                     |
|                     |
-----------------------
-----------------------
| DRAM                |
-----------------------

NB: cache and DRAM more continuous now.

The closer the memory is to the core, the lower the latency.
ALUs now very large and complex as it must support a large number of operations

GPU:

-----------------------
|C/C| ALU | ALU | ALU |
-----------------------
|C/C| ALU | ALU | ALU |
-----------------------
|C/C| ALU | ALU | ALU |
-----------------------
|C/C| ALU | ALU | ALU |
-----------------------
|C/C| ALU | ALU | ALU |
-----------------------
-----------------------
| DRAM                |
-----------------------

Many simple ALUs that share tiny control and cache structures.

More control over memory; can choose if a variable local, shared or global (three levels from cache). Also off-memory (CPU memory).

CUDA Model

Grid-based decomposition: the grid is the entire program, which is split into blocks.

Blocks are independent subtasks that can be run in any order. They can share memory with each other.

Threads in each block run the same instructions simultaneously but accessing different bit of memory. Hence, only one set of control logic is required for each block.

Thread blocks must be able to run independently; execute in any order, in parallel etc.

For thread safety, cannot read/write to the same memory address at the same time.

Numba

JIT Python library that can compile Python into machine code.

from numba import jit

# NB: `%timeit` returns the amount of taken to run a specific line of code in Jupyter

@jit
def bubblesort_jit(arr):
  N = len(arr)

  for end in range(N, 1, -1):
    for i in range(end - 1):
      cur = arr[i]

      if cur > arr[i + 1]:
        tmp = arr[i]
        arr[i] = arr[i + 1]
        arr[i + 1] = tmp

On the GPU:

from numba import cuda

@cuda.jit
def my_kernel(io_array):
  # io_array stored in shared memory
  pos = cuda.grid(1) # ID of the thread; like `rank` in MPI
  # Kinda like my_block_number * threads_per_block + my_thread_number
  # Can use cuda.gridsize(1) to get number of blocks/grids
  if pos < io_array.size:
    io_array[pos] = pos

data = numpy.ones(257)
threads_per_block = 256 # 256 threads in each block. Max ~1024 on most Nvidia GPUs, or less if it is a memory intensive program
blocks_per_grid = 1 # and just one block

# data copied from CPU memory to GPU shared memory
my_kernel[blocks_per_grid, threads_per_block](data)

print(data)
# Should print `0, 1, 2, ..., 254, 255, 1`

There are no return statements in CUDA; instead, pass in an array (e.g. cuda.to_device(np.zeros(shape = 1, dtype=np.int32))).

Incrementing a counter is not an atomic operation: it loads, adds then stores. Hence, if it is written after it has been read, the result will be overwritten. Hence, atomic operations are required to do this in a single step.

cuda.atomic.add(arr, index, increment_by)

cuda.atomic.compare_and_swap(arr, old, new) # if arr[0] == old, set it to new

Matrix multiplication: sum-multiply row of matrix A with column of matrix B to get one result.

@cuda.jit
def matrix_mul(A, B, C):
  row, col = cuda.grid(2) # Interpret ID as 2D position

  if row < C.shape[0] and col < C.shape[1]:
    # If row less than number of rows in C and col less than number of columns in C
    temp = 0
    for k in range(A.shape[1]):
      tmp += A[row, k] * B[k, col]

    C[row, col] = tmp

A = numpy.full((24, 12), 3, numpy.float) # 24x12 matrix of 3s
B = numpy.full((12, 22), 4, numpy.float)

A_global_mem = cuda.to_device(A)
B_global_mem = cuda.to_device(B)
C_global_mem = cude.device_array((24, 12)) # allocate memory for result array

threads_per_block = (16, 16)
blocks_per_grid_x = int(math.ceil(A.shape[0] / threads_per_block[0]))
blocks_per_grid_y = int(math.ceil(B.shape[1] / threads_per_block[1]))
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)

matrix_mul[blocks_per_grid, threads_per_block](A_global_mem, B_global_mem, C_global_mem)

C = C_global_mem.copy_to_host()

The memory access pattern for this is not great for one of A and B: memory is accessed contiguously so if the memory is row-oriented, reading a row of A will read from a single contiguous area of memory but reading a column of B will require a read for each element.

Numpy arrays are made in the CPU and then sent to the GPU. However, the array can be made directly on the GPU:

# Where `shape` is a tuple of dimension sizes e.g. (4, 6) for 4x6 2D array
# and `dtype` is a type such as `float32`
global_arr = cuda.device_array(shape)
shared_arr = cuda.shared.array(shape, dtype)

Locks

An alternative to atomic operations (that is available on CPUs); in a critical section of code, you can tell the hardware to not let anyone else read/write the value; mutual exclusion. This blocks other processes from running and so is slow.

Spin lock:

volatile char lock = 0;

while(true) {
  // Polling - inefficient
  if (lock == 0) {
    // Problem: lock gets updated between the read and write
    lock = 1;
    break;
  }
}

do_stuff();
lock = 0;

Atomic lock: make sure no one gets between load and store using a common primitive (compare-and-swap). Write only if the value in memory matches the expected value.

In pseudo-code, the compare-and-swap acts like:

temp = *addr;

if (temp == old) {
  *addr = new;
} else {
  old = temp;
}

This is implemented as a single atomic instruction. In x86, it is called CMPXCHG (compare and exchange).

Memory

Memory wall: processor speed increasing faster than memory bandwidth.

In 2007:

Registers: ~500 bytes, 250 ps
Cache reference: ~64 KB, 1 ns
Memory reference: ~ 1 GBk 100 ns

Bandwidth improving faster than latency: ~10x in the last 25 years, so this disparity is likely to stay constant for a while.

NVIDIA Memory

Each GPU has several stream multiprocessors, each with its own shared memory and several scalar processors:

Where code actually runs
~8 KB of register space per processor

Use of device memory (CPU RAM?) should be avoided where possible. In many cases, it is possible to schedule for the transfer for at least part of the dataset into shared memory so code is never waiting for this transfer.

Decompose the problem into smaller, independent problems such that the data fits in shared memory. For matrix multiplication:

# Threads per block. Problem split into blocks of TPBxTPB elements
TPB = 16

@cuda.jit
def fast_matrix_mul(A, B, C):
  sA = cuda.shared.array(shape = (TPB, TPB), dtype=float32) # Pointer to array of size
  sB = cuda.shared.array(shape = (TPB, TPB), dtype=float32)
  # Shared memory faster than global memory. Need to explicitly move data into shared memory

  x, y = cuda.grid(2)

  tx = cuda.threadIdx.x
  ty = cuda.threadIdx.y

  if x >= C.shape[0] and y >= C.shape[1]:
    # Outside of the shape
    return

  # Each thread computes one element of the result matrix
  tmp = 0
  for i in range(int(A.shape[1] / TPB)):
    # Preload into shared memory: each thread only reads two values from global memory
    sA[tx, ty] = A[x, ty + i * TPB]
    sB[tx, ty] = B[tx + i * TPB, y]

    cuda.syncthreads() # need to wait until all threads finish preloading their values into shared memory 
    # A barrier

    for j in range(TPB):
      tmp += sA[tx, j] * sB[j, ty]
    
    cuda.syncthreads() # need to wait until all threads finish computing
  
  C[x, y] = tmp

# This will be more than 16x faster (due to 16x fewer global memory reads) since memory access for threads within the block is contiguous, so can do the reads in one operation.

Thread Scheduling/Execution

Each thread block divided into 32-thread warps; the basic scheduling unit in SM. If the number of threads is not divisible by 32, the last warp will automatically disable the excess threads.

e.g. if 3 blocks are processed by a SM and each has 256 threads, there will be 3 blocks * (256 threads/block)/(32 threads/warp) = 24 warps.

At any point in time, only one of the warps will be selected for instruction fetch and execution.

The scheduler will look at memory access patterns and stagger/interleave their operations to reduce the amount of time that is spent waiting for memory access. Each warp can be in three states:

Executing
Waiting for data
Ready to execute

When a warp is selected for execution, all active threads execute the same instruction in lock-step.

Filling warps:

Prefer thread block sizes that result in (mostly) full warps
More threads per blocks are preferred so that the scheduler has warps to switch between (128 or so)
Resources like __shared__ can constrain the number of threads per block

Occupancy: number of warps running concurrently on a SM divided by max num. warps that can run concurrently (warps can belong to different blocks). Each Fermi SM can have up to 48 warps, but hardware constraints can reduce this:

Number of registers per kernel
- 32 KB per multiprocessor, partitioned among concurrent threads active on the SM
Shared memory
- 16 or 48 KB per multiprocessor, partitioned among SM concurrent blocks

The SM can switch between warps with no apparent overhead. This is possible because each warp has its own set of registers. Hence, the number of registers available is the total number divided by the number of warps. Hence, even if you have 8 KB of memory per block, if you have 128 threads each only has 64 bytes of memory.

TODO registers per warp or per thread?

If there are not enough registers it will put some in global memory.