Scheduling

参考
2-CUDA-Introduction-2-of-2
Cuda 编程之 Tiling

Physical Architecture

NVIDIA GeForce 8800 GTX G80-300-A2

df3024cc4f9c45d0138a0b8f4d569ae1.png

d1d005177d7b9a00c9d316afc9e4c0e7.png

8ba3a2b1ab45dd6bb6914b2946debaa7.png

最小的单位, SP(Streaming Processor),然后是SPs(Streaming Processors)

以 Tesla 架构的 GeForce 8 系列搭载 G80 的 8800 GTX 为例,其拥有 16 个 Streaming Multiprocessor

Config Name Num
Shading Units 128
TMUs 32
ROPs 24
SM Count 16
L2 Cache 96KB

相比较之下,Ada LoveLace 架构的 4070 Super 其拥有56个Streaming Multiprocessor

需要注意的是,每个SM所能含有的最大threads 是随着架构的提升而增大的,不过至今的数代都保持在了2048这个数字。

Warp

Users don’t control warp, it is hardware group of 32 consecutive “lanes”(32 threads of a block)

This is why we keep threads-per-block size as multiple of 32

  • Each thread occupies 1 lane, warp lanes are consecutive threadIdx values

  • Unit of sheduling and execution

  • CUDA GPUs schedule warps for execution, not threads

  • All threads in a warp execute same instruction at the same clock-time

  • Zero-overhead context switching - All resources reside on the same SM until execution

  • A block will always on the same SM

  • All threads in a warp execute the same insturction

    • While all threads in a grid run the same kernel code, only threads within the same warp are guaranteed to execute the same instruction at the same time.
    • They are all working from the same set of instructions, but they are not synchronized. They run independently and can be at completely different points in the program at any time.
  • Swapping sheduled warps has zero overheads

  • Scheduler always tries to optimize execution

  • More resources required - less warp

    • Registers: A large but limited set of super-fast memory for each thread’s private variables.

    • Shared Memory: A fixed amount of fast on-chip memory allocated to thread blocks

  • One SM can handle multiple block at the same time

    • Warp Size is the unit of execution.
    • Concurrent Threads is the unit of residency. This is the total number of threads that are loaded, active, and have their resources allocated on the SM, ready to be executed.
Scenario Registers per Thread Max Concurrent Threads on SM Max Concurrent Warps on SM
High Usage 64 32,768 / 64 = 512 512 / 32 = 16
Low Usage 16 32,768 / 16 = 2048 2048 / 32 = 64

Example Ⅰ

32 threads per wap but 8 SPs per SM, What gives?

When an SM shedule its

A kernal has

  • 1 global memory read
  • 4 non-dependent multiples/adds

How many warps are required to hide the memory latency ?

Each warp has 4 multiples/adds
16 cycles

We need to cover 200 cycles

  • 200/16 = 12.5
  • ceil(12.5) = 13

13 warps are required

Example Ⅱ

  • What actually happens when you launch a kernel say with 100 blocks each with 64 threads?
    • Let’s say on a GT80 (i.e. 16 SMs, 8 SPs each)
    • Max T/block is 512, Max T/SM is 768 threads
    • Chip level scheduling:
      • 100 blocks of 64 threads
      • 768/64 => max 12 blocks can be scheduled for every SM
    • SM level scheduling:
      • 12 blocks = 768 threads = 24 warps
      • Warp scheduler kicks in:
        • 8 threads -> 8 threads -> 8 threads -> 8 threads(因为这个是8 SPs each)

上述描绘了无需考虑thread所使用的resources的简化情况, 注意,每个SM这里的确可以支持12个blocks,但是100个并不是说就使用9个SM,Since your 100 blocks are fewer than the GPU’s capacity of 192, the scheduler will load all 100 blocks onto the 16 SMs immediately. Some SMs will get 6 blocks, and some will get 7, until all 100 are distributed.

Divergence

What happens if branches in a warp diverge ?

ec87ad03a1ab1e1349f452dd47b3e862.png

If threads within a warp take different paths in an if-else statement, the hardware has to serialize the paths, executing one after the other, which can reduce performance.

Synchronization

  • Call __syncthreads();, it is on a block level

Why it is important that execution time be similar among threads ?
Long-running threads lead to inefficient stalls

Why does it only synchronize within a block?
Execution across blocks is not concurrent or ordered

Golden Role

For a __syncthreads() to work, every single thread in a block must be able to reach the same __syncthreads() call. If even one thread takes a path that skips the synchronization point, the other threads in the block will wait for it forever.

Deadlock

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
__global__ void faulty_kernel(float* data) {
int idx = threadIdx.x;

// Even and odd threads take different paths
if (idx % 2 == 0) {
// Even threads do some work...
data[idx] = 1.0f;

// ...and then wait here forever for the odd threads.
__syncthreads();
} else {
// Odd threads are paused while the 'if' block executes.
// They will never reach the __syncthreads() call above.
data[idx] = 2.0f;
}
}
  1. A warp encounters an if-else statement. Some threads evaluate the condition as true, and others evaluate it as false.
  2. The hardware picks one path to execute first (e.g., the if block). The threads that chose this path become active, while the threads that need to take the else path are temporarily paused (“masked off”).
  3. The active threads enter the if block and hit the __syncthreads(). They now stop and wait for all other threads in their block to also arrive at this exact synchronization point.
  4. Deadlock: The paused threads can never reach the __syncthreads() inside the if block because they are waiting to execute the else block. The active threads will never proceed because they are waiting for the paused threads. The entire block is now stuck in a permanent standoff.

Each __syncthreads is unique

Revisit of Matrix Multiply

The previous code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
__global__ void MatrixMultiplyKernel(const float* devM, const float* devN,
float* devP, const int width)
{
int tx = threadIdx.x;
int ty = threadIdx.y;

// Initialize accumulator to 0
float pValue = 0;

// Multiply and add
for (int k = 0; k < width; k++) {
float m = devM[ty * width + k];
float n = devN[k * width + tx];
pValue += m * n;
}

// Write value to device memory - each thread has unique index to write to
devP[ty * width + tx] = pValue;
}

原来的想法很朴素,我最终的矩阵多大,我就开多少个线程去算,一个线程对应最终矩阵C的一个值。
d3c19410256d0c0e6f36c0828fbc1d8b.png
那么这个线程去算的时候呢,要看一下矩阵A的一行,再看矩阵B的一列。最终进行乘法、相加运算。C中的每个元素的线程都要去看A和看B,那么就会带来大量的对于Global Memory的访问,这会十分低效。

直觉上打一个不恰当的比方,我们给一个大房间铺地砖的时候,应该是拿个小推车去从大货车上装个一批,然后在房间中从这个小推车上拿了地砖去铺,小推车拿空了小推车再去装一批。而不是每铺一个地砖都要去大货车中拿一块地砖。

我们要做的是,找的一种办法提升shared memory的使用,减少Global Memory的访问,这种办法可以记录某些中间状态,让矩阵C的每个元素计算不至于总是去查A和、B
9e337856e7560f777e39050b03b69bfd.png
cb97f272024107617519bdf095999bf9.png
图片出自Cuda 编程之 Tiling

Say, we have matrix $A$

$$
\left[ \begin{array}{cc}
2 & 6 & 7 &5 \
3 & 1 & 4 &6 \
8 & 9 & 0 &1 \
2 & 7 & 7 &4
\end{array}
\right]
$$

matrix $B$

$$
\left[ \begin{array}{cc}
1 & 6 & 2 &1 \
3 & 9 & 8 &4 \
5 & 6 & 3 &9 \
1 & 0 & 7 &2
\end{array}
\right]
$$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
__global__ void MatrixMultiplyKernel(const float* devM, const float* devN,
float* devP, const int width)
{
__shared__ float sM[TILE_WIDTH][TILE_WIDTH];
__shared__ float sN[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;

int col = bx * width + tx;
int row = by * width + ty;

// Initialize accumulator to 0
float pValue = 0;

// Multiply and add
for (int m = 0; m < width / TILE_WIDTH; m++) {
sM[ty][tx] = devM[row * width + (m * TILE_WIDTH + tx)];
sN[ty][tx] = devN[col + (m * TILE_WIDTH + ty) * Width];
__syncthreads();

for (int k = 0; k < TILE_WIDTH; ++k)
pValue += sM[ty][k] * sN[k][tx];
__syncthreads();
}
devP[row * width + col] = pValue;
}

for the first iteration, we have

Phase 1: Processing the First Pair of Tiles

Load Tiles into Shared Memory

matrix $sM$

$$
\left[ \begin{array}{cc}
2 & 6 \
3 & 1\
\end{array}
\right]
$$

matrix $sN$

$$
\left[ \begin{array}{cc}
1 & 6 \
3 & 9\
\end{array}
\right]
$$

__syncthreads();

Compute from Shared Memory

Our target thread (0,0) now calculates its part of the dot product using the values in the fast shared memory.

1
2
3
pValue += sM[0][0] * sN[0][0] + sM[0][1] * sN[1][0]
pValue += (2 * 1) + (6 * 3)
pValue = 20

Our target thread (0,1):

1
2
3
pValue += sM[0][0] * sN[0][1] + sM[0][1] * sN[1][1]
pValue += (2 * 6) + (6 * 9)
pValue = 66

Our target thread (1,0):

1
2
3
pValue += sM[1][0] * sN[0][0] + sM[1][1] * sN[1][0]
pValue += (3 * 1) + (1 * 3)
pValue = 6

Our target thread(1,1):

1
2
3
pValue += sM[1][0] * sN[0][1] + sM[1][1] * sN[1][1]
pValue += (3 * 6) + (1 * 9)
pValue = 27

__syncthreads();

all four of those calculations are done at the same time, in parallel.

Phase 2: Processing the Second Pair of Tiles

Load Tiles into Shared Memory

matrix $sM$

$$
\left[ \begin{array}{cc}
7 & 5 \
4 & 6\
\end{array}
\right]
$$

matrix $sN$

$$
\left[ \begin{array}{cc}
5 & 6 \
1 & 0\
\end{array}
\right]
$$

__syncthreads();

Compute from Shared Memory

Our target thread (0,0) now calculates its part of the dot product using the values in the fast shared memory.

1
2
3
4
5
pValue += sM[0][0] * sN[0][0] + sM[0][1] * sN[1][0]
pValue += (7 * 5) + (5 * 1)
pValue += 40

pValue = 60

Our target thread (0,1):

1
2
3
4
5
pValue += sM[0][0] * sN[0][1] + sM[0][1] * sN[1][1]
pValue += (7 * 6) + (5 * 0)
pValue += 42

pValue = 108

Our target thread (1,0):

1
2
3
4
5
pValue += sM[1][0] * sN[0][0] + sM[1][1] * sN[1][0]
pValue += (4 * 5) + (6 * 1)
pValue += 26

pValue = 32

Our target thread(1,1):

1
2
3
4
5
pValue += sM[1][0] * sN[0][1] + sM[1][1] * sN[1][1]
pValue += (4 * 6) + (6 * 0)
pValue += 24

pValue = 51

__syncthreads();

all four of those calculations are done at the same time, in parallel.

Now, we have the left top block of C:

$$
\left[ \begin{array}{cc}
60 & 108 \
32 & 51\
\end{array}
\right]
$$

为什么我们可以这么做?

Dot product can be done as partial nums

在naive中,我们想求一个C的元素,我们一次性去拿A的对应横行和B的对应纵行进行计算。在tiling中,我们把这个计算分为多步,对于每个我们关注的C的子矩阵,即一个block,比如例子这里就是left top block of C,我们去看原矩阵A、B对应的block,而后进行运算。这会让我们不再每求一个C的元素都要将A的对应横行和B的对应纵行全部访问一遍,而是,我关注的这个C的block在原A矩阵中的对应(横向)block 以及原B矩阵中的对应(竖向)的block块A,B,C, 做累加操作。每次我都会先缓存以下当前关注的A、B中的block,称作sM、sN
以下是block同步开始的
top left block of C
drawio
drawio

top second left block of C
drawio
drawio

Cpp & Basic Concepts

CPU side serial code controls(send commands to) CPU side parallel code

  • 初始化数据(在CPU内存中)
  • 分配GPU内存
  • 将数据从CPU内存拷贝到GPU内存
  • 启动GPU上的核函数(Kernal)
  • 等待GPU计算完成
  • 将计算结果从GPU内存拷贝到CPU内存
  • 释放GPU和CPU内存

CUDA 函数执行空间限定符

限定符 执行位置 调用位置
__global__ 设备(GPU) 主机(CPU)
__device__ 设备 设备
__host__ 主机 主机

kernals are running on the GPU, so we use pointers to access memory

__global__

1
__global__ void myKernel()
  • must return void
  • 如果需要返回结果,必须通过传入指针,让核函数将结果写入GPU内存中
  • 使用一种特殊的 <<<...>>> 执行配置语法来调用,例如 myKernal<<<grid, block>>>(args...);

__device__

1
__device__ float myDeviceFunction()
  • 这是一个只能在GPU上执行,并且也只能被其他 __global____device__ 函数调用的函数。它通常用于在核函数中实现一些可重用的辅助功能,类似于普通C++代码中的普通函数。
  • Inlined by default

for all device code

  • No static variables
    • lifetime of the program
    • on the gpu there is no lifetime concept
  • No malloc()
    • never
    • many of threads tring to allocate, not enough cache to do that
    • compiler allows, but performance issue
    • GPU内存最好由主机端统一管理。标准的做法是在主机端使用 cudaMalloc() 分配一大块内存,然后将指向这块内存的指针传递给核函数。GPU线程在这个预先分配好的内存区域中进行读写操作。

__host__

1
__host__ int myHostFunction()

这就是一个普通的C/C++函数,在CPU上执行,也只能被CPU调用。如果不写任何限定符,函数默认就是 __host__

组合用法

__device__ __host__ void func() VALID
这意味着这个函数被编译了两次:一个版本用于在CPU上调用,另一个版本用于在GPU上调用。这在我们希望CPU和GPU共享某个工具函数(例如一个简单的数学计算)时非常有用。

__global__ __host__ void func() INVALID
这个组合在逻辑上是矛盾的。__global__ 的核心定义是“从CPU调用,在GPU执行”,而 __host__ 的定义是“从CPU调用,在CPU执行”。一个函数不能同时满足这两种执行模式,因此编译器禁止这种组合。

if __global__, it is only __global__ nothing else

pow, sqrt, exp, sin, cos
__powf, __sinf, __logf, __exp

Grid, Block, Thread

这些都是抽象的概念,并非是真是的物理结构,区分这些概念是为了更高效地编程
b789eb919c1cff5db1c02942466e4fc9.png

让不同的线程,拿不同的数据,进行相同的运算

gridDimblockDim 都是dim3

1688f01b2eea083b6cd69c7e1d158aed.png

上图中的, 表示1D Grid, 其在x方向上有4个Block,而其他方向上都只有1个Block (默认1个)

1
dim3 gridDim(4);

一般的,我们需要设置 block 中的每个维度的线程数为 32 的整数倍

Thread

执行计算的最基本单元。每个线程都会完整地执行一遍核函数 (__global__函数) 的代码。

Treading Model

Real world limitations
- No. of cores Cores/ Transistors for memory
- Power
- Scheduling

Thread Hierarchies

__global__ 函数内部,可以直接使用一些内置的只读变量来确定当前线程的位置:
dim3 gridDim 网格的维度
dim3 blockDim 线程块的维度
uint3 blockIdx 当前线程所在Block在Grid中的索引
uint3 threadIdx 当前线程在Block中的索引

1D: Thread ID == Thread Index
2D with size (Dx, Dy)
Thread ID of index (x, y) == x + y Dx
3D with size (Dx, Dy, Dz)
Thread ID of index (x, y, z) == x + y Dx + z Dx Dy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Kernel function to add two matrices
__global__ void matrixAddition(const float* A, const float* B, float* C, int width, int height)
{
// Calculate the global row and column index
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;

// Check if the thread is within the matrix bounds
if (col < width && row < height) {
int idx = row * width + col;
C[idx] = A[idx] + B[idx];
}
}

int main()
{
int width = 1024;
int height = 1024;
int numElements = width * height;
size_t size = numElements * sizeof(float);

// ... (此处省略了为 A, B, C 分配主机和设备内存,以及数据拷贝的代码) ...
// ... cudaMalloc, cudaMemcpy, etc. ...

// Define the dimensions of the thread block
// 通常选择一个二维的块,例如 16x16 或 32x32
dim3 threadsPerBlock(16, 16);

// Calculate the dimensions of the grid
// 向上取整,确保有足够的block覆盖整个矩阵
dim3 gridDim( (width + threadsPerBlock.x - 1) / threadsPerBlock.x,
(height + threadsPerBlock.y - 1) / threadsPerBlock.y );

std::cout << "Launching Kernel with Grid: (" << gridDim.x << ", " << gridDim.y << "), Block: (" << threadsPerBlock.x << ", " << threadsPerBlock.y << ")" << std::endl;

// Launch the kernel
matrixAddition<<<gridDim, threadsPerBlock>>>(A_d, B_d, C_d, width, height);

// ... (此处省略了将结果从C_d拷贝回C_h,以及释放内存的代码) ...
// ... cudaMemcpy, cudaFree, free ...

return 0;
}

Block

块内的线程可以相互协作,例如通过共享内存 (Shared Memory) 快速交换数据,也可以进行同步 (__syncthreads())。

一个块内的所有线程必须在同一个流式多处理器 (Streaming Multiprocessor, SM) 上执行

一个 Block 至多可以容纳 1024 个 Thread,这是自开普勒架构(GTX 10系列)显卡之后的规范。

但是,1024个线程并不意味着同一时间会执行1024个,最小的执行单位是Warp

1
dim3 threadsPerBlock(16, 16);

This line declares that each thread block will be a 2D grid containing 16 x 16 = 256 threads. You are creating a small, square team of threads. Here, threadsPerBlock.x is 16 and threadsPerBlock.y is 16. The z-dimension is 1 by default.

d812751cd8bd70d7974fb4061079c984.png

需要注意的是,这里的图片是Scalability的实例,并不是说每次SM就处理一个Block。在同一个Block上运行的threads 确实都在同一个SM上,也就意味着其都可以与L1通信,但是,通过这种办法来达成Block间的通信是不安全的。

不同块之间的线程是无法直接通信和同步的。

Grid

一组线程块的集合。
一个核函数的所有线程都组织在一个网格中