Posted 2025-09-03Updated 2025-09-07Penn / 25 Fall / CIS 5650 GPU Programming / Lectures14 minutes read (About 2163 words)

参考
2-CUDA-Introduction-2-of-2
Cuda 编程之 Tiling

Physical Architecture

NVIDIA GeForce 8800 GTX G80-300-A2

最小的单位， SP(Streaming Processor)，然后是SPs(Streaming Processors）

以 Tesla 架构的 GeForce 8 系列搭载 G80 的 8800 GTX 为例，其拥有 16 个 Streaming Multiprocessor

Config Name	Num
Shading Units	128
TMUs	32
ROPs	24
SM Count	16
L2 Cache	96KB

相比较之下，Ada LoveLace 架构的 4070 Super 其拥有56个Streaming Multiprocessor

需要注意的是，每个SM所能含有的最大threads 是随着架构的提升而增大的，不过至今的数代都保持在了2048这个数字。

Warp

Users don’t control warp, it is hardware group of 32 consecutive “lanes”(32 threads of a block)

This is why we keep threads-per-block size as multiple of 32

Each thread occupies 1 lane, warp lanes are consecutive threadIdx values
Unit of sheduling and execution
CUDA GPUs schedule warps for execution, not threads
All threads in a warp execute same instruction at the same clock-time
Zero-overhead context switching - All resources reside on the same SM until execution
A block will always on the same SM
All threads in a warp execute the same insturction
- While all threads in a grid run the same kernel code, only threads within the same warp are guaranteed to execute the same instruction at the same time.
- They are all working from the same set of instructions, but they are not synchronized. They run independently and can be at completely different points in the program at any time.
Swapping sheduled warps has zero overheads
Scheduler always tries to optimize execution
More resources required - less warp
- Registers: A large but limited set of super-fast memory for each thread’s private variables.
- Shared Memory: A fixed amount of fast on-chip memory allocated to thread blocks
One SM can handle multiple block at the same time
- Warp Size is the unit of execution.
- Concurrent Threads is the unit of residency. This is the total number of threads that are loaded, active, and have their resources allocated on the SM, ready to be executed.

Scenario	Registers per Thread	Max Concurrent Threads on SM	Max Concurrent Warps on SM
High Usage	64	32,768 / 64 = 512	512 / 32 = 16
Low Usage	16	32,768 / 16 = 2048	2048 / 32 = 64

Example Ⅰ

32 threads per wap but 8 SPs per SM, What gives?

When an SM shedule its

A kernal has

1 global memory read
4 non-dependent multiples/adds

How many warps are required to hide the memory latency ?

Each warp has 4 multiples/adds
16 cycles

We need to cover 200 cycles

200/16 = 12.5
ceil(12.5) = 13

13 warps are required

Example Ⅱ

What actually happens when you launch a kernel say with 100 blocks each with 64 threads?
- Let’s say on a GT80 (i.e. 16 SMs, 8 SPs each)
- Max T/block is 512, Max T/SM is 768 threads
- Chip level scheduling:
  - 100 blocks of 64 threads
  - 768/64 => max 12 blocks can be scheduled for every SM
- SM level scheduling:
  - 12 blocks = 768 threads = 24 warps
  - Warp scheduler kicks in:
    - 8 threads -> 8 threads -> 8 threads -> 8 threads（因为这个是8 SPs each）

上述描绘了无需考虑thread所使用的resources的简化情况，注意，每个SM这里的确可以支持12个blocks，但是100个并不是说就使用9个SM，Since your 100 blocks are fewer than the GPU’s capacity of 192, the scheduler will load all 100 blocks onto the 16 SMs immediately. Some SMs will get 6 blocks, and some will get 7, until all 100 are distributed.

Divergence

What happens if branches in a warp diverge ?

If threads within a warp take different paths in an if-else statement, the hardware has to serialize the paths, executing one after the other, which can reduce performance.

Synchronization

Call __syncthreads();, it is on a block level

Why it is important that execution time be similar among threads ?
Long-running threads lead to inefficient stalls

Why does it only synchronize within a block?
Execution across blocks is not concurrent or ordered

Golden Role

For a __syncthreads() to work, every single thread in a block must be able to reach the same __syncthreads() call. If even one thread takes a path that skips the synchronization point, the other threads in the block will wait for it forever.

Deadlock

__global__ void faulty_kernel(float* data) {
    int idx = threadIdx.x;

    // Even and odd threads take different paths
    if (idx % 2 == 0) {
        // Even threads do some work...
        data[idx] = 1.0f;
        
        // ...and then wait here forever for the odd threads.
        __syncthreads();
    } else {
        // Odd threads are paused while the 'if' block executes.
        // They will never reach the __syncthreads() call above.
        data[idx] = 2.0f;
    }
}

A warp encounters an if-else statement. Some threads evaluate the condition as true, and others evaluate it as false.
The hardware picks one path to execute first (e.g., the if block). The threads that chose this path become active, while the threads that need to take the else path are temporarily paused (“masked off”).
The active threads enter the if block and hit the __syncthreads(). They now stop and wait for all other threads in their block to also arrive at this exact synchronization point.
Deadlock: The paused threads can never reach the __syncthreads() inside the if block because they are waiting to execute the else block. The active threads will never proceed because they are waiting for the paused threads. The entire block is now stuck in a permanent standoff.

Each __syncthreads is unique

Revisit of Matrix Multiply

The previous code:

__global__ void MatrixMultiplyKernel(const float* devM, const float* devN,
                                     float* devP, const int width)
{
int tx = threadIdx.x;
int ty = threadIdx.y;

// Initialize accumulator to 0
float pValue = 0;

// Multiply and add
for (int k = 0; k < width; k++) {
float m = devM[ty * width + k];
float n = devN[k * width + tx];
pValue += m * n;
}

// Write value to device memory - each thread has unique index to write to
devP[ty * width + tx] = pValue;
}

原来的想法很朴素，我最终的矩阵多大，我就开多少个线程去算，一个线程对应最终矩阵C的一个值。

那么这个线程去算的时候呢，要看一下矩阵A的一行，再看矩阵B的一列。最终进行乘法、相加运算。C中的每个元素的线程都要去看A和看B，那么就会带来大量的对于Global Memory的访问，这会十分低效。

直觉上打一个不恰当的比方，我们给一个大房间铺地砖的时候，应该是拿个小推车去从大货车上装个一批，然后在房间中从这个小推车上拿了地砖去铺，小推车拿空了小推车再去装一批。而不是每铺一个地砖都要去大货车中拿一块地砖。

我们要做的是，找的一种办法提升shared memory的使用，减少Global Memory的访问，这种办法可以记录某些中间状态，让矩阵C的每个元素计算不至于总是去查A和、B

图片出自Cuda 编程之 Tiling

Say, we have matrix $A$

$$
\left[ \begin{array}{cc}
2 & 6 & 7 &5 \
3 & 1 & 4 &6 \
8 & 9 & 0 &1 \
2 & 7 & 7 &4
\end{array}
\right]
$$

matrix $B$

$$
\left[ \begin{array}{cc}
1 & 6 & 2 &1 \
3 & 9 & 8 &4 \
5 & 6 & 3 &9 \
1 & 0 & 7 &2
\end{array}
\right]
$$

__global__ void MatrixMultiplyKernel(const float* devM, const float* devN,
                                     float* devP, const int width)
{
    __shared__ float sM[TILE_WIDTH][TILE_WIDTH];
    __shared__ float sN[TILE_WIDTH][TILE_WIDTH];

    int bx = blockIdx.x; 		int by = blockIdx.y;
    int tx = threadIdx.x;		int ty = threadIdx.y;

    int col = bx * width + tx;
    int row = by * width + ty;

    // Initialize accumulator to 0
    float pValue = 0;

    // Multiply and add
    for (int m = 0; m < width / TILE_WIDTH; m++) {
        sM[ty][tx] = devM[row * width + (m * TILE_WIDTH + tx)];
        sN[ty][tx] = devN[col + (m * TILE_WIDTH + ty) * Width];
        __syncthreads();

        for (int k = 0; k < TILE_WIDTH; ++k)
            pValue += sM[ty][k] * sN[k][tx];
        __syncthreads();
    }
    devP[row * width + col] = pValue;
}

for the first iteration, we have

Phase 1: Processing the First Pair of Tiles

Load Tiles into Shared Memory

matrix $sM$

$$
\left[ \begin{array}{cc}
2 & 6 \
3 & 1\
\end{array}
\right]
$$

matrix $sN$

$$
\left[ \begin{array}{cc}
1 & 6 \
3 & 9\
\end{array}
\right]
$$

`__syncthreads();`

Compute from Shared Memory

Our target thread (0,0) now calculates its part of the dot product using the values in the fast shared memory.

1
2
3

pValue += sM[0][0] * sN[0][0] + sM[0][1] * sN[1][0]
pValue += (2 * 1) + (6 * 3)
pValue = 20

Our target thread (0,1):

1
2
3

pValue += sM[0][0] * sN[0][1] + sM[0][1] * sN[1][1]
pValue += (2 * 6) + (6 * 9)
pValue = 66

Our target thread (1,0):

1
2
3

pValue += sM[1][0] * sN[0][0] + sM[1][1] * sN[1][0]
pValue += (3 * 1) + (1 * 3)
pValue = 6

Our target thread(1,1):

1
2
3

pValue += sM[1][0] * sN[0][1] + sM[1][1] * sN[1][1]
pValue += (3 * 6) + (1 * 9)
pValue = 27

`__syncthreads();`

all four of those calculations are done at the same time, in parallel.

Phase 2: Processing the Second Pair of Tiles

Load Tiles into Shared Memory

matrix $sM$

$$
\left[ \begin{array}{cc}
7 & 5 \
4 & 6\
\end{array}
\right]
$$

matrix $sN$

$$
\left[ \begin{array}{cc}
5 & 6 \
1 & 0\
\end{array}
\right]
$$

`__syncthreads();`

Compute from Shared Memory

Our target thread (0,0) now calculates its part of the dot product using the values in the fast shared memory.

pValue += sM[0][0] * sN[0][0] + sM[0][1] * sN[1][0]
pValue += (7 * 5) + (5 * 1)
pValue += 40

pValue = 60

Our target thread (0,1):

pValue += sM[0][0] * sN[0][1] + sM[0][1] * sN[1][1]
pValue += (7 * 6) + (5 * 0)
pValue += 42

pValue = 108

Our target thread (1,0):

pValue += sM[1][0] * sN[0][0] + sM[1][1] * sN[1][0]
pValue += (4 * 5) + (6 * 1)
pValue += 26

pValue = 32

Our target thread(1,1):

pValue += sM[1][0] * sN[0][1] + sM[1][1] * sN[1][1]
pValue += (4 * 6) + (6 * 0)
pValue += 24

pValue = 51

`__syncthreads();`

all four of those calculations are done at the same time, in parallel.

Now, we have the left top block of C:

$$
\left[ \begin{array}{cc}
60 & 108 \
32 & 51\
\end{array}
\right]
$$

为什么我们可以这么做？

Dot product can be done as partial nums

在naive中，我们想求一个C的元素，我们一次性去拿A的对应横行和B的对应纵行进行计算。在tiling中，我们把这个计算分为多步，对于每个我们关注的C的子矩阵，即一个block，比如例子这里就是left top block of C，我们去看原矩阵A、B对应的block，而后进行运算。这会让我们不再每求一个C的元素都要将A的对应横行和B的对应纵行全部访问一遍，而是，我关注的这个C的block在原A矩阵中的对应（横向）block 以及原B矩阵中的对应（竖向）的block块A,B,C, 做累加操作。每次我都会先缓存以下当前关注的A、B中的block，称作sM、sN
以下是block同步开始的
top left block of C
drawio
drawio

top second left block of C
drawio
drawio

Posted 2025-08-27Updated 2025-09-04Penn / 25 Fall / CIS 5650 GPU Programming / Lectures10 minutes read (About 1510 words)

Cpp & Basic Concepts

CPU side serial code controls(send commands to) CPU side parallel code

初始化数据（在CPU内存中）
分配GPU内存
将数据从CPU内存拷贝到GPU内存
启动GPU上的核函数（Kernal）
等待GPU计算完成
将计算结果从GPU内存拷贝到CPU内存
释放GPU和CPU内存

CUDA 函数执行空间限定符

限定符	执行位置	调用位置
`__global__`	设备(GPU)	主机(CPU)
`__device__`	设备	设备
`__host__`	主机	主机

kernals are running on the GPU, so we use pointers to access memory

`global`

1	__global__ void myKernel()

must return void
如果需要返回结果，必须通过传入指针，让核函数将结果写入GPU内存中
使用一种特殊的 <<<...>>> 执行配置语法来调用，例如 myKernal<<<grid, block>>>(args...);

`device`

1	__device__ float myDeviceFunction()

这是一个只能在GPU上执行，并且也只能被其他 __global__ 或 __device__ 函数调用的函数。它通常用于在核函数中实现一些可重用的辅助功能，类似于普通C++代码中的普通函数。
Inlined by default

for all device code

No static variables
- lifetime of the program
- on the gpu there is no lifetime concept
No malloc()
- never
- many of threads tring to allocate, not enough cache to do that
- compiler allows, but performance issue
- GPU内存最好由主机端统一管理。标准的做法是在主机端使用 cudaMalloc() 分配一大块内存，然后将指向这块内存的指针传递给核函数。GPU线程在这个预先分配好的内存区域中进行读写操作。

`host`

1	__host__ int myHostFunction()

这就是一个普通的C/C++函数，在CPU上执行，也只能被CPU调用。如果不写任何限定符，函数默认就是 __host__

组合用法

__device__ __host__ void func() VALID
这意味着这个函数被编译了两次：一个版本用于在CPU上调用，另一个版本用于在GPU上调用。这在我们希望CPU和GPU共享某个工具函数（例如一个简单的数学计算）时非常有用。

__global__ __host__ void func() INVALID
这个组合在逻辑上是矛盾的。__global__ 的核心定义是“从CPU调用，在GPU执行”，而 __host__ 的定义是“从CPU调用，在CPU执行”。一个函数不能同时满足这两种执行模式，因此编译器禁止这种组合。

if __global__, it is only __global__ nothing else

pow, sqrt, exp, sin, cos
__powf, __sinf, __logf, __exp

Grid, Block, Thread

这些都是抽象的概念，并非是真是的物理结构，区分这些概念是为了更高效地编程

让不同的线程，拿不同的数据，进行相同的运算

gridDim 与 blockDim 都是dim3

上图中的, 表示1D Grid，其在x方向上有4个Block，而其他方向上都只有1个Block （默认1个）

1	dim3 gridDim(4);

一般的，我们需要设置 block 中的每个维度的线程数为 32 的整数倍

Thread

执行计算的最基本单元。每个线程都会完整地执行一遍核函数 (__global__函数) 的代码。

Treading Model

Real world limitations
- No. of cores Cores/ Transistors for memory
- Power
- Scheduling

Thread Hierarchies

在 __global__ 函数内部，可以直接使用一些内置的只读变量来确定当前线程的位置：
dim3 gridDim 网格的维度
dim3 blockDim 线程块的维度
uint3 blockIdx 当前线程所在Block在Grid中的索引
uint3 threadIdx 当前线程在Block中的索引

1D: Thread ID == Thread Index
2D with size (Dx, Dy)
Thread ID of index (x, y) == x + y Dx
3D with size (Dx, Dy, Dz)
Thread ID of index (x, y, z) == x + y Dx + z Dx Dy

// Kernel function to add two matrices
__global__ void matrixAddition(const float* A, const float* B, float* C, int width, int height)
{
    // Calculate the global row and column index
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;

    // Check if the thread is within the matrix bounds
    if (col < width && row < height) {
        int idx = row * width + col;
        C[idx] = A[idx] + B[idx];
    }
}

int main()
{
    int width = 1024;
    int height = 1024;
    int numElements = width * height;
    size_t size = numElements * sizeof(float);

    // ... (此处省略了为 A, B, C 分配主机和设备内存，以及数据拷贝的代码) ...
    // ... cudaMalloc, cudaMemcpy, etc. ...

    // Define the dimensions of the thread block
    // 通常选择一个二维的块，例如 16x16 或 32x32
    dim3 threadsPerBlock(16, 16);

    // Calculate the dimensions of the grid
    // 向上取整，确保有足够的block覆盖整个矩阵
    dim3 gridDim( (width + threadsPerBlock.x - 1) / threadsPerBlock.x,
                  (height + threadsPerBlock.y - 1) / threadsPerBlock.y );

    std::cout << "Launching Kernel with Grid: (" << gridDim.x << ", " << gridDim.y << "), Block: (" << threadsPerBlock.x << ", " << threadsPerBlock.y << ")" << std::endl;

    // Launch the kernel
    matrixAddition<<<gridDim, threadsPerBlock>>>(A_d, B_d, C_d, width, height);
    
    // ... (此处省略了将结果从C_d拷贝回C_h，以及释放内存的代码) ...
    // ... cudaMemcpy, cudaFree, free ...

    return 0;
}

Block

块内的线程可以相互协作，例如通过共享内存 (Shared Memory) 快速交换数据，也可以进行同步 (__syncthreads())。

一个块内的所有线程必须在同一个流式多处理器 (Streaming Multiprocessor, SM) 上执行

一个 Block 至多可以容纳 1024 个 Thread，这是自开普勒架构（GTX 10系列）显卡之后的规范。

但是，1024个线程并不意味着同一时间会执行1024个，最小的执行单位是Warp

1	dim3 threadsPerBlock(16, 16);

This line declares that each thread block will be a 2D grid containing 16 x 16 = 256 threads. You are creating a small, square team of threads. Here, threadsPerBlock.x is 16 and threadsPerBlock.y is 16. The z-dimension is 1 by default.

需要注意的是，这里的图片是Scalability的实例，并不是说每次SM就处理一个Block。在同一个Block上运行的threads 确实都在同一个SM上，也就意味着其都可以与L1通信，但是，通过这种办法来达成Block间的通信是不安全的。

不同块之间的线程是无法直接通信和同步的。

Grid

一组线程块的集合。
一个核函数的所有线程都组织在一个网格中

Posted 2025-05-25Updated 2025-06-08Graphics / Math40 minutes read (About 5926 words)

Curves & Spline

Refer to: The Continuity of Splines by Freya Holmér

Bernstein

此处，$B_0^{3}$ 等亦可写作 $B_{0,3}(u) = C(3,0) \cdot u ^{0} \cdot (1-u)^{3}$

DeCasteljau

Polynomial Coefficients
$$
f(u) = b_0 + u(-3b_0 + 3b_1) + u^{2}(3b_0 -6b_1 + 3b_2) + u^{3}(-b_0 + 3b_1 -3b_2 + b_3)
$$
Efficient Evaluation using Horner’s Method：
$$
f(u) = C_0 + u(C_1 + u(C_2 + uC_3))
$$

Subdivision of Bezier Curve

Derivatives of Bezier Curves

For the case where u = 0

when u = 1

Why Spline？

如果需要我们对一条曲线有更多的控制能力，我们可以考虑提高贝塞尔曲线的次数，例如，一个6次贝塞尔曲线会有7个控制点，其形如：

$$
X(s) = (1-s)^6 P_0 + 6s(1-s)^5 P_1 + 15s^2(1-s)^4 P_2 + 20s^3(1-s)^3 P_3 + 15s^4(1-s)^2 P_4 + 6s^5(1-s) P_5 + s^6 P_6
$$

但是，移动每个控制点的时候，整个曲线都会受到影响，计算也会变得更加复杂。

为了更稳定地计算并获得更好地局部控制线和直观性，我们会将多段较低阶地贝塞尔曲线连接起来。

三种切线

$chrome\_vcPTjok4lE.gif$

Continuity

$C^{n}$ continuous if $A^{(i)}(t_{end}) = B^{(i)}(t_{start})$ for $i$ in $ {0,…,n}$

观察下图，可以看到速度并不是连续的，也就是不满足 $C^1$ 连续性。即使原 spline 在 joint 处的两个控制点形成的两个切线已经 aligned (方向相同，但是大小不同)。

$chrome\_Lda92InYAp.gif$

当控制点关于 joint 点 mirror 时， $C^1$连续。

$chrome\_sTLPFaDnwu.gif$

如何构造出 $C^1$ 呢？Mirror 意味着怎么样的关系呢？

我们对一段贝塞尔曲线求导，有

$$
f’(u) = -3b_0 + 3b_1 + 2u(3b_0 -6b_1 + 3b_2) + 3u^2(-b_0 + 3b_1 -3b_2 + b_3)
$$

那么，当$u_0 = 1$时（第一段曲线的右端点）

$$
f’(1) = -3b_2 + 3b_3
$$

当$u_1 = 0$ （第二段曲线的左端点）

$$
f’(0) = -3b_3 + 3b_4
$$

当$f’(1) =f’(0)$时，有

$$
b_4 = 2b_3 - b_2
$$

也就是说，当 Mirror 时，$b_4$ 的位置已经被 $b_2$ 与 $b_3$ 完全决定，不再能被用户自由控制。

$C^2$ 连续性？

$chrome\_Xx82K9kfO3.gif$
加速度改变速度。

我们对上面提到的速度再求一次导，有

$$
f’’(u) = 6b_0 -12b_1 + 6b_2 + 6u(-b_0 + 3b_1 -3b_2 + b_3)
$$

那么，当$u_0 = 1$时（第一段曲线的右端点）

$$
f’’(1) = 6b_1 - 12b_2 +6 b_3
$$

当$u_1 = 0$ （第二段曲线的左端点）

$$
f’’(0) = 6b_3 -12b_4 + 6b_5
$$

当$f’’(1) =f’(0)$时，有

$$
b_5 = b_1 - 2 b_2 + 2b_4
$$

而$C^{1}$ =又保证了，$b_4$ 已经完全被$b_3$ 与$b_2$ 控制

$$
b_4 = 2b_3 - b_2
$$

所以在$C^{2}$下有

$$
b_5 = b_1 + 4(b_3 - b_2)
$$

也就是说，当 $C^{2}$ 时，$b_5$ 的位置已经被 $b_1$ $b_2$ 与 $b_3$ 完全决定，不再能被用户自由控制。
$chrome\_WDuM7jIPbn.gif$
对于多段贝塞尔曲线构成的spline，保证$C^2$时，很多控制点的位置都已经被完全决定，控制能力变弱了，并且，微小的控制会导致后面剧烈的变化。

Geometric Continuity

$A(t)$ and $B(t)$ are
$G^n$ ccontinuous if a function $g(t)$ exists so that $A(t)$ and $B(g(t))$ are $C^{n}$ continuous

$G^{1}$

$chrome\_5eyM5Uy5jo.gif$
切线连续性通常用 G¹ 连续性 (Geometric Continuity, first-order) 来表示。为了让两条曲线在连接点处达到 G¹ 连续，必须满足以下两个条件：

G⁰ 位置连续性 (Positional Continuity): 两条曲线的端点必须在连接点处重合。也就是说，它们必须接触。

切线方向相同: 在连接点处，两条曲线的切线向量必须指向相同的方向（或者完全相反的方向，但在实际应用中通常要求方向一致）。这意味着两条曲线在该点的变化趋势是一致的。
值得注意的是，G¹ 连续性只要求切线的方向相同，而不要求切线向量的大小（即参数导数的大小）也相同。如果切线向量的大小和方向都相同，那么就达到了更高级别的 $C^1$ 参数连续性 (Parametric Continuity, first-order)。 $C^1$ 连续性是比 G¹ 更严格的条件

满足 $G^{1}$ 时，有

$$
b_4 = b_3 + (b_3 - b_2) \beta
$$

即，共线同向，但不要求大小相同。

$chrome\_cdfzZ0lzIx.gif$

在曲面上，我们可以注意到反光有条缝
$chrome\_ujR8IalkRS.gif$

$chrome\_L4eUaUYzGF.gif$

Curvature $k$

$$ \kappa = \frac{\begin{vmatrix} P'_{x} & P''_{x} \\\\ P'_{y} & P''_{y} \end{vmatrix}}{{\Vert \mathbf{P}' \Vert}^3} $$

$chrome\_HRGJI65w0F.gif$

仅仅$C^{0}$时，也可能是Curvature连续的，而$C^{1}$时，未必连续。

$G^{2}$ 连续性

$$
\mathbf{P}_5 = \mathbf{P}_3 + (\mathbf{P}_3 - \mathbf{P}_2)(2\beta_1 + \beta_1^2 + \beta_2/2) + (\mathbf{P}_1 - \mathbf{P}_2)\beta_1^2
$$

$\mathbf{P}_5$ 被上述控制点与参数共同决定。
$chrome\_Bp9R9dMuIx.gif$

$G^{2}$ 以及以上被称为A级面

*as long as the curves are regular

Hermite Curves

贝塞尔曲线利用四个控制点来构造，而Hermite曲线由两个端点和端点的速度。即

$$
P(0) = P_0 \\
P(1) = P_1 \\
P’(0) = \mathbf{v}_0 \\
P’(1) = \mathbf{v}_1
$$

通过上述关系求解

$$
P(t) = at^3 + bt^2 + ct + d
$$

得到

$$
\mathbf{P}(t) =
\begin{bmatrix}
1 & t & t^2 & t^3
\end{bmatrix}
\begin{bmatrix}
1 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 \\
-3 & -2 & 3 & -1 \\
2 & 1 & -2 & 1
\end{bmatrix}
\begin{bmatrix}
\mathbf{P}_0 \\
\mathbf{v}_0 \\
\mathbf{P}_1 \\
\mathbf{v}_1
\end{bmatrix}
$$

通过设置连接点处切向量相等，可以方便地实现$C^1$

Hermite to Bezier:

$$
B_0 = P_0 \\
B_1 = P_0 + v_0/3 \\
B_2 = P_1 - v_1/3 \\
B_3 = P_1
$$

Cardinal Spline

从此处开始$ \mathbf{P}_0$，$ \mathbf{P}_1$，$ \mathbf{P}_2$，$ \mathbf{P}_3$ 指的是spline上的相邻四个joints点

如何串联一系列给定的点呢？
我们可以采用一种类似Hermite的方法，我们计算一个joint相邻两点的位移，并以此作为此点的速度
$chrome\_XaGsOtgDjK.gif$

但首尾两点并没有被穿进去，我们可以将原第二个点关于首个joint对称过去。

但是这样得到的曲线看起来有些呆板。出现了长段的较为平坦的线条，我们可以引入一个 Scale 参数 $s$ 来缩放我们的速度。当$s$接近于0的时候，也就退化成了线性spline

$$
\mathbf{P}(t) =
\begin{bmatrix}
1 & t & t^2 & t^3
\end{bmatrix}
\begin{bmatrix}
0 & 1 & 0 & 0 \\
-s & 0 & s & 0 \\
2s & s-3 & 3-2s & -s \\
-s & 2-s & s-2 & s
\end{bmatrix}
\begin{bmatrix}
\mathbf{P}_0 \\
\mathbf{P}_1 \\
\mathbf{P}_2 \\
\mathbf{P}_3
\end{bmatrix}
$$

当$s = 0.5$时，我们有

Catmull-Rom Spline

Interpolating every point!
$C^1$ continuous， not $C^{2}$
No need to specify tangents

$$
\mathbf{P}(t) = \begin{bmatrix} 1 & t & t^2 & t^3 \end{bmatrix} \frac{1}{2} \begin{bmatrix} 0 & 2 & 0 & 0 \\ -1 & 0 & 1 & 0 \\ 2 & -5 & 4 & -1 \\ -1 & 3 & -3 & 1 \end{bmatrix} \begin{bmatrix} \mathbf{P}_0 \\ \mathbf{P}_1 \\ \mathbf{P}_2 \\ \mathbf{P}_3 \end{bmatrix}
$$

B-Spline(Basis-Spline)

那么，有没有一种构造Spline的方式，使之可以满足$C^2$呢？也就是说，我们要对这个求解：

$$
\mathbf{P}(t) =
\begin{bmatrix}
1 & t & t^2 & t^3
\end{bmatrix}
\begin{bmatrix}
\textcolor{red}{c_0} & \textcolor{cyan}{c_1} & \textcolor{green}{c_2} & \textcolor{orange}{c_3} \\
\textcolor{red}{c_4} & \textcolor{cyan}{c_5} & \textcolor{green}{c_6} & \textcolor{orange}{c_7} \\
\textcolor{red}{c_8} & \textcolor{cyan}{c_9} & \textcolor{green}{c_{10}} & \textcolor{orange}{c_{11}} \\
\textcolor{red}{c_{12}} & \textcolor{cyan}{c_{13}} & \textcolor{green}{c_{14}} & \textcolor{orange}{c_{15}}
\end{bmatrix}
\begin{bmatrix}
\textcolor{red}{\mathbf{P}_0} \\
\textcolor{cyan}{\mathbf{P}_1} \\
\textcolor{green}{\mathbf{P}_2} \\
\textcolor{orange}{\mathbf{P}_3}
\end{bmatrix}
$$

$$
\begin{gather}
\textcolor{red}{B_0(1) = 0} \tag{1} \
\textcolor{red}{B’{0}(1) = 0} \tag{2} \
\textcolor{red}{B’’{0}(1) = 0} \tag{3} \
\textcolor{red}{B_0(0)} = \textcolor{cyan}{B_1(1)} \tag{4} \
\textcolor{red}{B’{0}(0)} = \textcolor{cyan}{B’{1}(1)} \tag{5} \
\textcolor{red}{B’’{0}(0)} = \textcolor{cyan}{B’’{1}(1)} \tag{6} \
\textcolor{cyan}{B_1(0)} = \textcolor{green}{B_2(1)} \tag{7} \
\textcolor{cyan}{B’{1}(0)} = \textcolor{green}{B’{2}(1)} \tag{8} \
\textcolor{cyan}{B’’{1}(0)} = \textcolor{green}{B’’{2}(1)} \tag{9} \
\textcolor{green}{B_2(0)} = \textcolor{orange}{B_3(1)} \tag{10} \
\textcolor{green}{B’{2}(0)} = \textcolor{orange}{B’{3}(1)} \tag{11} \
\textcolor{green}{B’’{2}(0)} = \textcolor{orange}{B’’{3}(1)} \tag{12} \
\textcolor{orange}{B_3(0) = 0} \tag{13} \
\textcolor{orange}{B’{3}(0) = 0} \tag{14} \
\textcolor{orange}{B’’{3}(0) = 0} \tag{15} \
\textcolor{red}{B_0(t)} + \textcolor{cyan}{B_1(t)} + \textcolor{green}{B_2(t)} + \textcolor{orange}{B_3(t)} = 1 \tag{16}
\end{gather}
$$

$chrome\_s6bGCm2Kqa.gif$

$$
\mathbf{P}(t) =
\begin{bmatrix}
1 & t & t^2 & t^3
\end{bmatrix}
\frac{1}{6}
\begin{bmatrix}
1 & 4 & 1 & 0 \
-3 & 0 & 3 & 0 \
3 & -6 & 3 & 0 \
-1 & 3 & -3 & 1
\end{bmatrix}
\begin{bmatrix}
\textcolor{red}{\mathbf{P}_0} \
\textcolor{cyan}{\mathbf{P}_1} \
\textcolor{green}{\mathbf{P}_2} \
\textcolor{orange}{\mathbf{P}_3}
\end{bmatrix}
$$

Not interpolating
$C^2$ continuous! $G ^ 2$

B 样条的递推定义

refer to B样条的定义de boor-cox公式的理解

数学定义：

$$
P(t) = \sum_{i=0}^{n} P_i F_{i,k}(t) \qquad t \in [t_{k-1}, t_{n+1}]
$$

$k$ 是 B 样条的次数
$n$ 是控制点的最大索引，如果有$P_{i}, (i = 0, 1, …, n)$共 $n + 1$个控制点，那么最大索引就是$n$
$i$ 是这些控制点的序号

de Boor-Cox 递推定义，只与 t 有关

基函数的递推：

$$
\left{
\begin{aligned}
& F_{i,0}(t) =
\begin{cases}
1, & t_i \le t < t_{i+1} \
0, & \text{其他}
\end{cases}
\[2ex]
& F_{i,k}(t) = \frac{t-t_i}{t_{i+k}-t_i} F_{i,k-1}(t) + \frac{t_{i+k+1}-t}{t_{i+k+1}-t_{i+1}} F_{i+1,k-1}(t)
\[2ex]
& \text{约定 } \frac{0}{0} = 0
\end{aligned}
\right.
$$

$i$ 是基函数的编号，每个$P_i$都有一个权重函数，也就是基函数。

$i$ 也是向量节点 $T$ 中节点的索引。基函数$F_{i,k}(t)$的具体形状，它在时间轴上的起点和终点，完全是由节点向量$T = [t_0, t_1, t_2, … ]$ 决定的。$i$ 告诉我们，基函数 $F_{i,k}(t)$ 在 $[t_i, t_{i+k+1}]$ 是不为零的。

控制点的递推：

$$
\begin{equation*}
\mathbf{P}{i,r}(t) = (1 - \alpha{i,r}) \mathbf{P}{i-1, r-1}(t) + \alpha{i,r} \mathbf{P}_{i, r-1}(t) \tag{}
\end{equation*}
$$

$$
\alpha_{i,r} = \frac{t - t_i}{t_{i+k+1-r} - t_i}
$$

$k$ 是 B 样条的次数，那么这个样条的阶数是$k+1$，一段就有$k+1$个控制点

$r$ 是计算的轮数，$r$从1走到$k$

我们把所有的（不仅是某一段的）, 共(n + 1)个控制点记作 $P_{i}, (i = 0, 1, …, n)$ ，有 n + k +2 ($节点数 = 总控制点数 + 次数 + 1$) 个节点矢量 $T = [t_0, t_1, … t_{n + k + 1}]$

例如，5 个控制点，3次的B样条曲线，由于一段是由4个控制点决定的，那么就有（5-3 = ）2个线段，1个joint

对于零次B样条：

其基函数可以直接得到

对于一次B样条：

$$
F_{i,1}(t) = \frac{t - t_i}{t_{i+1} - t_i} F_{i,0}(t) + \frac{t_{i+2} - t}{t_{i+2} - t_{i+1}} F_{i+1,0}(t)
$$

$$
F_{i+1,0}(t) =
\begin{cases}
1, & \text{若 } t_{i+1} \le t < t_{i+2} \[1.5ex]
0, & \text{其他}
\end{cases}
$$

对于二次 B 样条：

$$
F_{i,2}(t) = \frac{t-t_i}{t_{i+2}-t_i} F_{i,1}(t) + \frac{t_{i+3}-t}{t_{i+3}-t_{i+1}} F_{i+1,1}(t)
$$

$$
F_{i+1,1}(t) =
\begin{cases}
\frac{t-t_{i+1}}{t_{i+2}-t_{i+1}}, & \text{若 } t_{i+1} \le t < t_{i+2} \[2ex]
\frac{t_{i+3}-t}{t_{i+3}-t_{i+2}}, & \text{若 } t_{i+2} \le t < t_{i+3} \[2ex]
0, & \text{其他}
\end{cases}
$$

对于三次 B 样条：

$$
F_{i,3}(t) = \frac{t - t_i}{t_{i+3} - t_i} F_{i,2}(t) + \frac{t_{i+4} - t}{t_{i+4} - t_{i+1}} F_{i+1,2}(t)
$$

$$
F_{i+1,2}(t) =
\begin{cases}
\frac{t - t_{i+1}}{t_{i+3} - t_{i+1}} \cdot \frac{t - t_{i+1}}{t_{i+2} - t_{i+1}}, & \text{若 } t_{i+1} \le t < t_{i+2} \[3ex]
\frac{t-t_{i+1}}{t_{i+3}-t_{i+1}} \cdot \frac{t_{i+3}-t}{t_{i+3}-t_{i+2}} + \frac{t_{i+4}-t}{t_{i+4}-t_{i+2}} \cdot \frac{t-t_{i+2}}{t_{i+3}-t_{i+2}}, & \text{若 } t_{i+2} \le t < t_{i+3} \[3ex]
\frac{(t_{i+4}-t)^2}{(t_{i+4}-t_{i+2})(t_{i+4}-t_{i+3})}, & \text{若 } t_{i+3} \le t < t_{i+4} \[3ex]
0, & \text{其他}
\end{cases}
$$

例

三次 B 样条, 控制点为$P_0 = (30, 0)$, $P_1 = (60, 10)$, $P_2 = (80, 30)$, $P_3 = (90, 60)$, $P_4 = (90, 90)$，节点向量(Knot Vector): $T = [t_0, t_1, t_2, t_3,t_4,t_5,t_6,t_7, t_8] = [0,0,0,0,0.5, 1,1,1,1]$
求曲线上一点 $P(\frac{1}{4})$ 的值

首先观察这个点在哪个区间上，由于$t_3 = 0 < \frac{1}{4} < t_4 = 0.5$，那么这个区间的索引是 $i = 3$

在区间$[t_i, t_{i+1})$内，对于$k$次B样条，需要用到$k + 1$个控制点，这些点的索引是$P_{i-k}$, $P_{i-k +1}, … P_{i}$

那么，在区间$[t_3, t_{4})$ 内，对于3次B样条，需要用到4个控制点，这些点的索引是$P_0, P_1, P_2, P_3$

对于 k 次B样条，需要经行 k 轮运算，每一轮都使用上一轮计算出的点进行插值。

回顾数学定义：

$$
P(t) = \sum_{i=0}^{n} P_i F_{i,k}(t)
$$

$$
P(\frac{1}{4}) = \sum_{i=0}^{3} P_i F_{i,3}(\frac{1}{4}) \
P(\frac{1}{4}) = P_0F_{0,3}(\frac{1}{4}) + P_1 F_{1,3}(\frac{1}{4}) + P_2 F_{2,3}(\frac{1}{4}) + P_3F_{3,3}(\frac{1}{4})
$$

据此，利用上述的关于$F_{i,k}$的递推关系，我们可以算出各个$F_{i,k}$ 的值，进而$P(\frac{1}{4})$的值。不过，基于$F_{i,k}$的递推关系，我们可以找到一些最终值和控制点的关系：

$$
\begin{equation*}
\mathbf{P}{i,r}(t) = (1 - \alpha{i,r}) \mathbf{P}{i-1, r-1}(t) + \alpha{i,r} \mathbf{P}_{i, r-1}(t) \tag{}
\end{equation*}
$$

$$
\alpha_{i,r} = \frac{t - t_i}{t_{i+k+1-r} - t_i}
$$

第一轮计算，也就是 $r = 1$：

计算 $P_{1,1}$ ，由 $P_{0,0}$ 和 $P_{1,0}$ 插值得到

权重 $\alpha_{1,1} = \frac{t - t_1}{t_{1+3+1-1} - t_1} = \frac{0.25 - t_1}{t_4 - t_1} = \frac{0.25 - 0}{0.5 - 0} = 0.5$
$P_{1,1} = (1 - 0.5)P_{0,0} + 0.5 \cdot P_{1,0} = 0.5(30,0) + 0.5(60,10) = (15,0) + (30,5) = (45, 5)$

计算 $P_{2,1}$ (由 $P_{1,0}$ 和 $P_{2,0}$ 插值得到):

权重 $\alpha_{2,1} = \frac{t - t_2}{t_{2+3+1-1} - t_2} = \frac{0.25 - t_2}{t_5 - t_2} = \frac{0.25 - 0}{1 - 0} = 0.25$
$P_{2,1} = (1 - 0.25)P_{1,0} + 0.25 \cdot P_{2,0} = 0.75(60,10) + 0.25(80,30) = (45, 7.5) + (20, 7.5) = (65, 15)$

计算 $P_{3,1}$ (由 $P_{2,0}$ 和 $P_{3,0}$ 插值得到):

权重 $\alpha_{3,1} = \frac{t - t_3}{t_{3+3+1-1} - t_3} = \frac{0.25 - t_3}{t_6 - t_3} = \frac{0.25 - 0}{1 - 0} = 0.25$
$P_{3,1} = (1 - 0.25)P_{2,0} + 0.25 \cdot P_{3,0} = 0.75(80,30) + 0.25(90,60) = (60, 22.5) + (22.5, 15) = (82.5, 37.5)$

第二轮计算，也就是 $r = 2$：
计算 $P_{2,2}$ (由 $P_{1,1}$ 和 $P_{2,1}$ 插值得到):
权重 $\alpha_{2,2} = \frac{t - t_2}{t_{2+3+1-2} - t_2} = \frac{0.25 - t_2}{t_4 - t_2} = \frac{0.25 - 0}{0.5 - 0} = 0.5$
$P_{2,2} = (1 - 0.5)P_{1,1} + 0.5 \cdot P_{2,1} = 0.5(45,5) + 0.5(65,15) = (22.5, 2.5) + (32.5, 7.5) = (55, 10)$

计算 $P_{3,2}$ (由 $P_{2,1}$ 和 $P_{3,1}$ 插值得到):
权重 $\alpha_{3,2} = \frac{t - t_3}{t_{3+3+1-2} - t_3} = \frac{0.25 - t_3}{t_5 - t_3} = \frac{0.25 - 0}{1 - 0} = 0.25$
$P_{3,2} = (1 - 0.25)P_{2,1} + 0.25 \cdot P_{3,1} = 0.75(65,15) + 0.25(82.5, 37.5) = (48.75, 11.25) + (20.625, 9.375) = (69.375, 20.625)$

第三轮计算 ($r = 3$)，最终结果
计算 $P_{3,3}$ (由 $P_{2,2}$ 和 $P_{3,2}$ 插值得到):
权重 $\alpha_{3,3} = \frac{t - t_3}{t_{3+3+1-3} - t_3} = \frac{0.25 - t_3}{t_4 - t_3} = \frac{0.25 - 0}{0.5 - 0} = 0.5$
$P_{3,3} = (1 - 0.5)P_{2,2} + 0.5 \cdot P_{3,2} = 0.5(55,10) + 0.5(69.375, 20.625)$
$P_{3,3} = (27.5, 5) + (34.6875, 10.3125) = (62.1875, 15.3125)$

Trajectory

$$
\begin{bmatrix} 1 & 0 & 0 & 0 \ 0 & 1! & 0 & 0 \ 0 & 0 & 1/2! & 0 \ 0 & 0 & 0 & 1/3! \end{bmatrix} \begin{bmatrix} \mathbf{p}_0 \ \mathbf{v}_0 \ \mathbf{g}_0 \ \mathbf{j}_0 \end{bmatrix}
$$

$$
\mathbf{P}(t) = \begin{bmatrix} 1 & t & t^2 & t^3 \end{bmatrix} \begin{bmatrix} \mathbf{p}_0 \ \mathbf{v}_0 \ \frac{1}{2}\mathbf{g}_0 \ \frac{1}{6}\mathbf{j}_0 \end{bmatrix} = \mathbf{p}_0 \cdot 1 + \mathbf{v}_0 \cdot t + \left(\frac{1}{2}\mathbf{g}_0\right) \cdot t^2 + \left(\frac{1}{6}\mathbf{j}_0\right) \cdot t^3
$$

Conclusion

$chrome\_LUajWPqZNo.gif$

	Deg.	Cont.	Tangents	Interpol.	Use cases
Bézier	3	C⁰/C¹	manual	some	shapes, fonts & vector graphics
Hermite	3	C⁰/C¹	explicit	all	animation, physics sim & interpolation
Catmull-Rom	3	C¹	auto	all	animation & path smoothing
B-Spline	3	C²	auto	none	curvature-sensitive shapes & animations, such as camera paths
Linear	1	C⁰	auto	all	dense data & interpolation where smoothness doesn’t matter

Handling Non-Uniform Time in Splines

目前讨论的是最经典的情况，即每个曲线段的“时长”都是1个单位（称之为**参数化均匀 (Uniform Parameterization)**）。

当每个曲线段的实际时间或“长度”不同时（参数化非均匀, Non-Uniform Parameterization），我们需要对连续性的定义进行一些调整。核心思想是：我们关心的连续性（如速度、加速度）是相对于真实时间 T 的，而不是曲线的局部参数 u 的。

1. 核心变化：全局时间 (T) vs 局部参数 (u)

首先，我们要区分两个概念：

局部参数 u：这始终是从0到1变化的值，用来在单个贝塞尔曲线段内部进行插值。你的笔记中所有的 f(u) 都是基于此。
**全局时间 T**：这是整个样条的“真实”时间轴。

假设第 i 段曲线的开始时间是 Tᵢ，结束时间是 Tᵢ₊₁。那么该段的持续时间是 ΔTᵢ = Tᵢ₊₁ - Tᵢ。
我们可以将全局时间 T 转换为该段的局部参数 u：
$u = \frac{T - T_i}{T_{i+1} - T_i} = \frac{T - T_i}{\Delta T_i}$

2. 关键：使用链式法则 (Chain Rule) 重新定义连续性

物理上的速度和加速度是曲线 P 对全局时间 T 的导数，而不是对 u 的导数。我们需要使用链式法则来连接它们：

一阶导数 (速度)

$\frac{dP}{dT} = \frac{dP}{du} \cdot \frac{du}{dT}$
从上面的公式我们知道 du/dT = 1 / ΔTᵢ。所以：
$\mathbf{P}’(T) = \frac{1}{\Delta T_i} \mathbf{P}’(u)$
这意味着，真实的速度向量是 P'(u) 的一个缩放版本，缩放因子是该段时间间隔的倒数。

二阶导数 (加速度)

对一阶导数再次求导：

$$
\frac{d^2P}{dT^2} = \frac{d}{dT}\left( \frac{1}{\Delta T_i} \frac{dP}{du} \right) = \frac{1}{\Delta T_i} \cdot \frac{d}{dT}\left(\frac{dP}{du}\right)
$$

再次使用链式法则：

$$
\mathbf{P}’’(T) = \frac{1}{\Delta T_i} \cdot \frac{d}{du}\left(\frac{dP}{du}\right) \cdot \frac{du}{dT} = \frac{1}{\Delta T_i} \cdot \mathbf{P}’’(u) \cdot \frac{1}{\Delta T_i} = \frac{1}{(\Delta T_i)^2} \mathbf{P}’’(u)
$$

真实加速度的缩放因子是时间间隔平方的倒数。

3. 如何修正上述的Splines？

现在我们把这个理论应用到样条上。

A. 贝塞尔曲线的 $C^1$ 连续性

为了实现 $C^1$ 连续，需要满足 b₄ = 2b₃ - b₂，这意味着控制点 b₃ 是 b₂ 和 b₄ 的中点（所谓的 “Mirror”）。这是因为我们假设了两段曲线的 ΔT 相同，所以我们让 P'(u) 在连接点处相等。

在非均匀时间下，我们需要让真实速度 P'(T) 相等。

第一段曲线（控制点 $b_0, b_1, b_2, b_3$，时长 $\Delta T_0$）在连接点 $b_3$ 的速度是：
$\mathbf{P}’(T_1) = \frac{1}{\Delta T_0} \mathbf{P}’(u=1) = \frac{3}{\Delta T_0} (b_3 - b_2)$
第二段曲线（控制点 $b_3, b_4, b_5, b_6$，时长 $\Delta T_1$）在连接点 $b_3$ 的速度是：
$\mathbf{P}’(T_1) = \frac{1}{\Delta T_1} \mathbf{P}’(u=0) = \frac{3}{\Delta T_1} (b_4 - b_3)$

让它们相等：

$$
\frac{3}{\Delta T_0} (b_3 - b_2) = \frac{3}{\Delta T_1} (b_4 - b_3)
$$

整理后得到：

$$
\frac{b_3 - b_2}{\Delta T_0} = \frac{b_4 - b_3}{\Delta T_1}
$$

结论：控制点 b₂, b₃, b₄ 仍然需要共线，但 b₃ 不再是中点。b₃ 到 b₂ 和 b₄ 的距离之比等于两段曲线的时间长度之比：

$$
\frac{\vert \vert b_3 - b_2 \vert \vert}{\vert \vert b_4 - b_3 \vert \vert} = \frac{\Delta T_0}{\Delta T_1}
$$

“Mirror” 关系变成了一个按时间比例缩放的“加权”关系。

均匀情况下的 Catmull-Rom

Catmull-Rom 的核心是它能自动计算每个点 Pᵢ 上的切线向量 vᵢ，计算方法是：
$\mathbf{v}i = \frac{1}{2} (\mathbf{P}{i+1} - \mathbf{P}_{i-1})$
这个公式的本质是一个几何上的平均。它假设从 Pᵢ₋₁ 到 Pᵢ 和从 Pᵢ 到 Pᵢ₊₁ 所花费的“时间”或“代价”是相同的，所以它给予了前后两个点完全相等的权重。

问题：为什么在非均匀时间下会失效？

想象一下这三个点在时间轴上的位置：

Pᵢ₋₁ 在 T = 0
Pᵢ 在 T = 1
Pᵢ₊₁ 在 T = 10

从 Pᵢ₋₁ 到 Pᵢ 只花了1秒，而从 Pᵢ 到 Pᵢ₊₁ 花了9秒。显然，Pᵢ 点的瞬时速度应该更接近于前一段的“高速”状态，而不是被后一段“慢速”状态平均掉。均匀公式 vᵢ = 0.5 * (Pᵢ₊₁ - Pᵢ₋₁) 完全忽略了时间信息，会导致计算出的切线不符合物理直觉。

解决方案：基于“弦速度”的加权平均

为了解决这个问题，我们需要将切线的计算从几何向量升级为物理速度。

计算弦速度 (Chord Velocity)
我们把连接相邻两个点的直线段称为“弦”。我们可以计算出每根弦的平均速度。
- Pᵢ 前一根弦的平均速度: $\mathbf{S}{i-1} = \frac{\mathbf{P}i - \mathbf{P}{i-1}}{T_i - T{i-1}} = \frac{\mathbf{P}i - \mathbf{P}{i-1}}{\Delta T_{i-1}}$
- Pᵢ 后一根弦的平均速度: $\mathbf{S}{i} = \frac{\mathbf{P}{i+1} - \mathbf{P}i}{T{i+1} - T_i} = \frac{\mathbf{P}_{i+1} - \mathbf{P}i}{\Delta T{i}}$
平均弦速度得到切线
最直接的非均匀 Catmull-Rom 实现，就是将这两个物理意义上的弦速度进行平均，以此作为 Pᵢ 点的切线：
$\mathbf{v}i = \frac{1}{2} (\mathbf{S}{i-1} + \mathbf{S}{i}) = \frac{1}{2} \left( \frac{\mathbf{P}i - \mathbf{P}{i-1}}{\Delta T{i-1}} + \frac{\mathbf{P}_{i+1} - \mathbf{P}i}{\Delta T{i}} \right)$
这个公式现在包含了时间间隔 ΔT，使得时间短（速度快）的线段对切线方向和大小的贡献更大，这远比均匀版本要合理。

均匀情况下的 B-Spline

均匀 B-Spline 的节点向量是等距的，例如 [0, 1, 2, 3, 4, 5, ...]。这导致：

所有基函数（Basis Functions）B(t) 的形状都是完全一样的，只是在时间轴上进行了平移。
曲线的行为在任何地方都是一致的，控制点的移动对曲线形状的影响也是可预测且均匀的。

2. 解决方案：非均匀节点向量 (Non-Uniform Knot Vector)

非均匀 B-Spline 允许节点向量中的值不均匀分布，例如 [0, 0, 0, 1.5, 3.2, 5.0, 5.0, 5.0]。这个看似简单的改变带来了巨大的灵活性。

改变基函数的形状
B-Spline 的基函数是通过 Cox-de Boor 递归公式定义的。我们来看看这个公式（以k次样条为例）：
$B_{i,k}(t) = \frac{t - t_i}{t_{i+k} - t_i} B_{i,k-1}(t) + \frac{t_{i+k+1} - t}{t_{i+k+1} - t_{i+1}} B_{i+1,k-1}(t)$
请注意分母中的 tᵢ₊ₖ - tᵢ。这些项是节点向量中不同节点值之间的差。
- 在均匀情况下，这些差值总是相同的。
- 在非均匀情况下，这些差值是变化的！这意味着基函数的形状本身会根据节点间距的不同而改变。如果节点靠得很近，对应的基函数会变得又高又窄；如果节点分得很开，基函数会变得又矮又宽。
控制曲线的局部行为
由于基函数的形状可以改变，我们可以通过操纵节点向量来精确控制曲线的行为：
- 改变速度：将一段区域内的节点排得更紧密，会使曲线在这段“时间”内更快地变化，因为它要更快地经过受这些节点影响的控制点。
- 改变连续性：这是非均匀B-Spline最强大的特性之一。在一个节点上重复多次（增加节点的“重数”，Multiplicity），可以降低该点的连续性。对于三次B-Spline（通常是 $C^2$ 连续）：
  - 无重复节点 (重数为1): $C^2$ 连续 (曲率连续)。
  - 重复1次 (重数为2): 降为 $C^1$ 连续 (切线连续，但曲率不连续，可形成平滑拐角)。
  - 重复2次 (重数为3): 降为 $C^0$ 连续 (位置连续，但切线不连续，可形成尖角)。
  - 重复3次 (重数为4，即k+1次): 曲线会穿过对应的控制点。

总结：B-Spline 通过其核心定义——非均匀节点向量——来原生处理非均匀参数化。

总结

概念	均匀时间 (Uniform)	非均匀时间 (Non-Uniform)
核心	局部参数 `u` 的导数连续	全局时间 `T` 的导数连续 (使用链式法则)
$C^1$ 贝塞尔	控制点 `b₃` 是 `b₂, b₄` 的中点	`b₂, b₃, b₄` 共线，但距离比等于时间比
Catmull-Rom	切线 `vᵢ = 0.5 * (Pᵢ₊₁ - Pᵢ₋₁)`	需要使用更复杂的加权公式计算切线
B-Spline	使用等距的节点向量	使用非等距的节点向量，自然地处理非均匀性

Posted 2025-05-13Updated 2025-09-03Physics12 minutes read (About 1831 words)

Simulation Basics

Spring - Damper

$$
f = -k_s \boldsymbol{x} - k_d \boldsymbol{v}
$$

where,
$k_s = \text{spring constant}$

$\boldsymbol{x} = \text{distance from rest state}$
$x = \vert \vert \boldsymbol{x}_1 -\boldsymbol{x}_2 \vert \vert - l$
$\hat{\boldsymbol{d}} = \frac{\boldsymbol{x}_1 - \boldsymbol{x}_2}{\vert \vert\boldsymbol{x}_1 - \boldsymbol{x}_2 \vert \vert}$
$\boldsymbol{x} = x\hat{\boldsymbol{d}}$

$k_d = \text{damping factor}$
$\boldsymbol{v} = \text{velocity along spring axis}$

Aerodynamic Forces

$f_{drag} = \frac{1}{2} \rho \vert \vert \boldsymbol{v}\vert \vert ^2 c_d A \hat{\boldsymbol{v}}$

$f_{lift} = \frac{1}{2} \rho \vert \vert \boldsymbol{v}\vert \vert ^2 c_L S(\alpha - \alpha_0)$

Simulation

General Steps

Compute all forces acting on each particle in the
system in the current configuration
Compute the resulting acceleration for each particle
(a=f/m)
Integrate over some small time step Dt to get new
positions and velocities at next time step
Repeat

一般的，我们有这样的关系：

$$
\ddot{\boldsymbol{x}} = \frac{1}{m} \boldsymbol{f(x, \dot{x})}
$$

或是，

$$
\dot{\boldsymbol{v}} = \frac{1}{m} \boldsymbol{f(x, v)}
$$

可以理解为，当前的加速度由当前的位移与速度决定。例如，在一个弹簧阻尼系统中，当前受到的力是位移和速度共同决定的。

我们往往想要求一阶的表示方式$\boldsymbol{x}(t)$

Phase Space

State Position (6 x 1)
$s = \left[\begin{matrix}
\boldsymbol{x} \ \boldsymbol{v}
\end{matrix}\right]$

State Velocity (6 x 1)
$s = \left[\begin{matrix}
\dot{\boldsymbol{x}} \ \dot{\boldsymbol{v}}
\end{matrix}\right]$

Particle System Dynamics (6 x 1)

$s = \left[\begin{matrix}
\boldsymbol{v} \ \frac{\boldsymbol{f}}{m}
\end{matrix}\right]$

清空旧力 → 计算新力 → 准备数据给求解器 → 用求解器更新粒子状态

Angular Momentum: $\boldsymbol{L} = \boldsymbol{r} \times \boldsymbol{p} = \boldsymbol{r} \times m \boldsymbol{v}$

角动量可以理解为物体旋转的某种“惯性”，它是一种矢量物理量，具有大小和方向

Rate of change of Angular Momentum (力矩/扭矩)

$\boldsymbol{\tau} = \frac{d\boldsymbol{L}}{dt} = \frac{d\boldsymbol{r}}{dt} \times m\boldsymbol{v} + \boldsymbol{r} \times m \frac{d\boldsymbol{v}}{dt}$

$\boldsymbol{\tau} = \boldsymbol{r} \times \boldsymbol{f}
$
力与力臂的乘积

velocity of the particle:
$\boldsymbol{v} = \frac{d\boldsymbol{r}}{dt} = \omega \times r$

当一个刚体绕固定轴旋转时，其上的点做的是圆周运动。圆周运动的速度方向是切线方向，可以通过角速度的叉乘表达出来。

$$
\mathbf{v} = \frac{d\mathbf{r}}{dt} = \dot{\mathbf{r}} = \boldsymbol{\omega} \times \mathbf{r}
$$

Acceleration:

$$
\mathbf{a} = \frac{d\mathbf{v}}{dt} = \frac{d\boldsymbol{\omega}}{dt} \times \mathbf{r} + \boldsymbol{\omega} \times \frac{d\mathbf{r}}{dt}
$$

$$
\quad = \dot{\boldsymbol{\omega}} \times \mathbf{r} + \boldsymbol{\omega} \times (\boldsymbol{\omega} \times \mathbf{r})
$$

当角加速度为零时：
$\dot{\boldsymbol{\omega}} = 0 $

Centripetal acceleration

$$
\mathbf{a} = \boldsymbol{\omega} \times (\boldsymbol{\omega} \times \mathbf{r})
$$

Centrifugal force

$$
\mathbf{f}_{\text{centrifugal}} = -m \boldsymbol{\omega} \times (\boldsymbol{\omega} \times \mathbf{r})
$$

If the length of $\mathbf{r}$ is changing as a function of time then,
Velocity:

$$
\mathbf{v} = \frac{d\mathbf{r}}{dt} = \boldsymbol{\omega} \times \mathbf{r} + \mathbf{v}_r
$$

Acceleration:

$$
\frac{d\mathbf{v}}{dt} = \frac{d\boldsymbol{\omega}}{dt} \times \mathbf{r}

\boldsymbol{\omega} \times \frac{d\mathbf{r}}{dt}
\frac{d\mathbf{v}_r}{dt}
$$

$$
= \dot{\boldsymbol{\omega}} \times \mathbf{r}

\boldsymbol{\omega} \times \dot{\mathbf{r}}
\dot{\mathbf{v}}_r
$$

因为：

$$
\dot{\mathbf{r}} = \boldsymbol{\omega} \times \mathbf{r} + \mathbf{v}_r
$$

所以：

$$
\mathbf{a} = \dot{\boldsymbol{\omega}} \times \mathbf{r} + \boldsymbol{\omega} \times (\boldsymbol{\omega} \times \mathbf{r}) + \dot{\mathbf{v}}_r
$$

Rewriting Angular Momentum

$$
\mathbf{L} = \mathbf{r} \times \mathbf{p} = \mathbf{r} \times (m\mathbf{v}) = m\mathbf{r} \times \mathbf{v} = m\mathbf{r} \times (\boldsymbol{\omega} \times \mathbf{r}) = -m \mathbf{r} \times \mathbf{r} \times \boldsymbol{\omega}
$$

这里的 𝑟 表示的是：

质点（或物体上某一点）相对于旋转轴/参考点的位置向量。

🔹 Replace $\mathbf{r} \times$ with a cross product equivalent matrix $\boldsymbol{\Gamma}$ , where:

$$
\boldsymbol{\Gamma} =
\begin{bmatrix}
0 & -r_z & r_y \
r_z & 0 & -r_x \
-r_y & r_x & 0
\end{bmatrix}
$$

🔹 This allows:

$$
\mathbf{L} = -m \mathbf{r} \times (\mathbf{r} \times \boldsymbol{\omega}) = -m \boldsymbol{\Gamma} \cdot \boldsymbol{\Gamma} \cdot \boldsymbol{\omega}
$$

🔹 Define Moment of Inertia Matrix:

$$
\mathbf{I} = -m \boldsymbol{\Gamma} \cdot \boldsymbol{\Gamma}
$$

🔹 Results in the Angular Momentum equation:

$$
\mathbf{L} = \mathbf{I} \cdot \boldsymbol{\omega}
$$

Integration

当我们知道一个关系后，例如first-order ODE form：

$\dot{\boldsymbol{x}} = \boldsymbol{f(x,u)}$ ，我们想知道 $\boldsymbol{x}(t)$

根据泰勒展开公式，有

$$
\mathbf{x}(t_0 + \Delta t) =
\mathbf{x}(t_0) +
\frac{d\mathbf{x}}{dt}(t_0) \Delta t +
\frac{1}{2!} \frac{d^2 \mathbf{x}}{dt^2}(t_0) \Delta t^2 +
\frac{1}{3!} \frac{d^3 \mathbf{x}}{dt^3}(t_0) \Delta t^3 + \cdots
$$

换一种写法：用离散帧时间 $t_k$

可以把时间离散成一帧一帧（每一帧时长为 $\Delta t$）：

$$
\mathbf{x}(t_{k+1}) \approx \mathbf{x}(t_k) +
\dot{\mathbf{x}}(t_k)\Delta t +
\frac{1}{2!} \ddot{\mathbf{x}}(t_k)\Delta t^2 +
\frac{1}{3!} \dot{\ddot{\mathbf{x}}}(t_k)\Delta t^3 + \cdots
$$

高次的项可以由链式法则求得

仅保留一阶的项时，有

$$
\mathbf{x}(t_{k+1}) \approx \mathbf{x}(t_k) +
\dot{\mathbf{x}}(t_k)\Delta t
$$

此处，

$$
\dot{\mathbf{x}}(t_k) = \mathbf{f(x(t_k), u(t_k) )}
$$

这样做的话如果时间间隔过长，会很不稳定。

我们考虑使用 $2^{nd} \text{Order Runge Kutta}$
Step 1.
使用欧拉方法计算预测值

$$
\mathbf{x}^p(t_{k+1}) = \mathbf{x}(t_k) + \dot{\mathbf{x}}(t_k)\Delta t
$$

而后，讲此预测值带入到关系中，有

$$
\dot{\mathbf{x}}^p(t_{k+1}) = f(\mathbf{x}^p(t_{k+1}), \mathbf{u}(t_k))
$$

最后求取一个平均值

$$
\mathbf{x}(t_{k+1}) = \mathbf{x}(t_k) + \frac{\Delta t}{2} \left[
\dot{\mathbf{x}}(t_k) + \dot{\mathbf{x}}^p(t_{k+1})
\right]
$$

4ᵗʰ Order Runge Kutta

$$
\mathbf{x}(t_{k+1}) = \mathbf{x}(t_k) + \frac{1}{6} \left[ \mathbf{d}_1 + 2\mathbf{d}_2 + 2\mathbf{d}_3 + \mathbf{d}_4 \right]
$$

🔹 Where:

$\mathbf{d}_1 = \Delta t \cdot \dot{\mathbf{x}}(t_k) = \Delta t \cdot \mathbf{f}(\mathbf{x}(t_k), \mathbf{u}(t_k))$

$\mathbf{d}_2 = \Delta t \cdot \mathbf{f} \left(t_k + \frac{\Delta t}{2}, \mathbf{x}(t_k) + \frac{1}{2}\mathbf{d}_1 \right)$

$\mathbf{d}_3 = \Delta t \cdot \mathbf{f} \left(t_k + \frac{\Delta t}{2}, \mathbf{x}(t_k) + \frac{1}{2}\mathbf{d}_2 \right)$

$\mathbf{d}_4 = \Delta t \cdot \mathbf{f} \left(t_k + \Delta t, \mathbf{x}(t_k) + \mathbf{d}_3 \right)$

🔁 Backwards Euler:

$\mathbf{x}(t_k) = \mathbf{x}(t_{k+1}) - \dot{\mathbf{x}}(t_{k+1}) \Delta t$

or equivalently:

$\mathbf{x}(t_{k+1}) = \mathbf{x}(t_k) + \dot{\mathbf{x}}(t_{k+1}) \Delta t$

问题是，我们怎么算 $\dot{\mathbf{x}}(t_{k+1})$

①根据我们已知的关系，
我们${\boldsymbol{\dot {x}}(t_{k+1})} = \boldsymbol{f}(\boldsymbol{x}(t_{k+1}))$
但是我们不知道$\boldsymbol{x}(t_{k+1})$，我们使用RK2来估算这个值

②一般的，我们使用泰勒展开

$$
\boldsymbol{x}(t_{k+1}) = \boldsymbol x(t_k) + \Delta t \cdot \dot{ \boldsymbol{x}}(t_{k+1}) \ = \boldsymbol{x}(t_{k}) + \Delta \boldsymbol{x}(t_k)
$$

$$
\dot{\mathbf{x}}(t_{k+1}) = \boldsymbol{f}(\boldsymbol{x}(t_{k+1})) = \boldsymbol{f}(\boldsymbol{x}(t_{k}) + \Delta \boldsymbol{x}(t_k))
$$

$$
\boldsymbol{x}(t_{k}) + \Delta \boldsymbol{x}(t_k) = \boldsymbol{x}(t_k) + \Delta t \cdot \boldsymbol{f}(\boldsymbol{x}(t_{k}) + \Delta \boldsymbol{x}(t_k))
$$

$$
\mathbf{x}(t_{k+1}) = \mathbf{x}(t_k) + \Delta \mathbf{x}(t_k)
$$

$$
\mathbf{x}(t_{k+1}) = \mathbf{x}(t_k) +
\left( \frac{1}{\Delta t} \mathbf{I} - \frac{\partial \mathbf{f}}{\partial \mathbf{x}} (t_k) \right)^{-1}
\mathbf{f}(\mathbf{x}(t_k))
$$

设：

$$
\mathbf{f}(x, y) =
\begin{bmatrix}
f_1(x, y) = x^2 + y \
f_2(x, y) = \sin(xy)
\end{bmatrix}
$$

则 Jacobian 是：

$$
\frac{\partial \mathbf{f}}{\partial \mathbf{x}} =
\begin{bmatrix}
\frac{\partial f_1}{\partial x} & \frac{\partial f_1}{\partial y} \
\frac{\partial f_2}{\partial x} & \frac{\partial f_2}{\partial y}
\end{bmatrix}

\begin{bmatrix}
2x & 1 \
y\cos(xy) & x\cos(xy)
\end{bmatrix}
$$

Adams Bashforth

🔹 Keep terms up to 2ⁿᵈ order in Taylor series approx

$$
\mathbf{x}(t_{k+1}) = \mathbf{x}(t_k) + \dot{\mathbf{x}}(t_k),\Delta t

\frac{1}{2!},\ddot{\mathbf{x}}(t_k),\Delta t^2
$$

🔹 Approximate the second derivative as

$$
\ddot{\mathbf{x}}(t_k)
\approx
\frac{\dot{\mathbf{x}}(t_k) - \dot{\mathbf{x}}(t_{k-1})}{\Delta t}
$$

🔹 Which yields

$$
\mathbf{x}(t_{k+1})
= \mathbf{x}(t_k)

\frac{\Delta t}{2},\bigl[3,\dot{\mathbf{x}}(t_k) - \dot{\mathbf{x}}(t_{k-1})\bigr]
$$

Verlet Integration

🔹 Does not require velocity to be explicitly computed

🔹 Given the current value of $\mathbf{x}(t_k)$, compute $\mathbf{x}$ at the next time step $t_{k+1}$ and previous time step $t_{k-1}$using a 2ⁿᵈ order approximation:

$$
\mathbf{x}(t_{k+1}) = \mathbf{x}(t_k) + \dot{\mathbf{x}}(t_k) \Delta t + \frac{1}{2} \ddot{\mathbf{x}}(t_k) \Delta t^2
$$

$$
\mathbf{x}(t_{k-1}) = \mathbf{x}(t_k) - \dot{\mathbf{x}}(t_k) \Delta t + \frac{1}{2} \ddot{\mathbf{x}}(t_k) \Delta t^2
$$

🔹 Adding the two together and rearranging yields:

$$
\mathbf{x}(t_{k+1}) = 2\mathbf{x}(t_k) - \mathbf{x}(t_{k-1}) + \ddot{\mathbf{x}}(t_k) \Delta t^2
$$

Posted 2024-03-26Updated 2025-06-07Graphics / Math17 minutes read (About 2567 words)

Complex number & Quaternion

Refer to: quaternion.pdf quaternion by Krasjet

坐标系

本文将使用右手坐标系统，用右手定则来定义旋转的正方向，即右手拇指指向旋转轴的 $\bf{u}$ 的正方向，这时其它四个手指弯曲的方向即为旋转的正方向。

基础

点乘，标量积

可以看到，$\boldsymbol{v}_{||}$其实就是 $\boldsymbol{v}$ 在 $\boldsymbol{u}$ 上的正交投影(OrthogonalProjection)，根据正交投影的公式，我们可以得出：

我们需要处理正交于 $\boldsymbol{u}$ 的 $\bf v_{\perp}$

方向永远垂直于矢量ab所在的平面啊。具体方向由右手定则决定。四指跟从向量A和B的旋转方向时，大拇指所朝的方向。

对于四元数 $q = a + bi +cj + dk$， $i^{2} = j^{2} = k^{2} = ijk = -1$

逆和共轭

$q^{-1}$是$q$的逆，规定：

$$
qq^{-1} = q^{-1} q = 1 (q \neq 0)
$$

有

$$
(pq)q^{-1} = p(qq^{-1}) = p \cdot 1 = p \
q^{-1}(qp) = (q^{-1}q)p = 1 \cdot p = p
$$

如果 $|| q || = 1$，也就是说 $q$ 是一个单位四元数，那么

$$
q^{-1} = \frac{q ^*}{1^{2}} = q^* = [s, \bf{-v}]
$$

旋转

换言之，当我们有 $q= [\cos(\theta), \sin(\theta)\bf{u}]$时，那么$v’ = qvq^*$ 可以将 $\bf{v}$ 沿着 $\bf{u}$ 旋转 2$\theta$ 度。

我们把 $\bf{v}$ 写作一个纯四元数，然后按照四元数乘法进行运算，注意，四元数乘法有结合律，但四元数乘法通常没有交换律。

左乘一个四元数 $q = a + bi +cj + dk$ 等同于左乘下面这个矩阵：

$$
L(q) = \begin{bmatrix}
a & -b & -c & -d \\
b & a & -d & c \\
c & d & a & -b \\
d & -c & b & a
\end{bmatrix}
$$

右乘 $q$ 等同于这个左乘这个矩阵：

$$
R(q) = \begin{bmatrix}
a & -b & -c & -d \\
b & a & d & -c \\
c & -d & a & b \\
d & c & -b & a
\end{bmatrix}
$$

我们有：$qvq^* = L(q)R(q^*)v = R(q^*)L(q)v$, 注意$R(q^*) = R(q)^{T}$

也就是：

$$
qvq^* = \begin{bmatrix}
1 & 0 & 0 & 0 \
0 & 1 - 2c^2 - 2d^2{} & 2bc - 2ad & 2ac + 2bd \
0 & 2bc + 2ad & 1 - 2b^2 - 2d^2 & 2cd - 2ab \
0 & 2bd - 2ac & 2ab + 2cd & 1 - 2b^2 - 2c^2
\end{bmatrix} v
$$

四元数，轴-角，旋转矩阵，欧拉角转换

任意向量 $\bf{v}$ 沿着以单位向量定义的旋转轴 $\bf{u}$ 旋转 $\theta$ 角度之后的 $\bf{v’}$ 可以使用矩阵乘法来获得。

令 $v = [0, \boldsymbol{v} ], q = [\cos(\frac{\theta}{2}), \sin(\frac{\theta}{2})\boldsymbol{u}]$

四元数与3D旋转的关系并不是一对一的，同一个3D旋转可以使用两个不同的四元数来表示。
对于任意的$q = [\cos(\frac{\theta}{2}), \sin(\frac{\theta}{2})\boldsymbol{u}]$，$q$与$-q$ 代表的是同一个旋转。如果𝑞表示的是沿着旋转轴$\bf{u}$旋转θ度，那么−𝑞代表的是沿着相反的旋转轴$- \bf{u}$旋转(2𝜋−θ)度。

对于常见的 ZYX 顺序（Unity Inspector 中使用的顺序，对应先绕Y轴，再绕X轴，再绕Z轴应用旋转，或者说物体先绕自身的Z轴，再绕自身新的X轴，再绕自身新的Y轴）： $q_X=[cos(α/2),sin(α/2)(1,0,0)]$ $q_Y=[cos(β/2),sin(β/2)(0,1,0)]$ $q_Z=[cos(γ/2),sin(γ/2)(0,0,1)]$ 最终的四元数 $q=q_Y⋅q_X⋅q_Z$ (注意乘法顺序，取决于欧拉角的具体定义是内旋还是外旋，以及轴的顺序)。

这个转换是明确的：给定一组特定顺序的欧拉角，总能计算出唯一的一个四元数（不考虑 q 和 −q 的等价性）。

插值

假设有两个旋转变换 $q_0 = [\cos(\frac{\theta_0}{2}), \sin(\frac{\theta_0}{2})\boldsymbol{u_0}]$, $q_1 = [\cos(\frac{\theta_1}{2}), \sin(\frac{\theta_1}{2})\boldsymbol{u_1}]$, 我们希望找出一些中间变换$q_t$, ，让初始变换$q_0$能够平滑地过渡到最终变换$q_1$．𝑡 的取值可以是𝑡 ∈[0,1]

旋转的变化量其实对应的仍是一个旋转。其将由 $q_0$ 变换到 $v_0$ 的向量进一步旋转到 $v_t$ 这个旋转拥有某一个固定的旋转轴 $\bf u_t$，，我们只需要缩放这个变换所对应的角度𝜙就能够达到插值的目的了。

我们该怎么获得这个旋转的变化量呢?

考虑一下什么变换Δ𝑞能将已经旋转到$v_0$的向量$v$直接变换到$v_1$，这其实是一个旋转的复合，我们先进行$q_0$变换，再进行Δ𝑞变换，它们复合的结果需要等于$q_1$变换，即

$$
\Delta q q_0 = q_1 \\
\Delta q q_0 q_0 ^{-1} = q_1 q_0 ^{-1}
$$

而所有的旋转$q$都是单位四元数，有$q^{-1} = q^*$，上式又可以写作

$$
\Delta q = q_1 q_0^*
$$

Lerp

$$
q_t = Lerp(q_0, q_1, t) = (1-t)q_0 + tq_1
$$

我们是沿着一条直线（也就是圆上的一个弦）进行插值的，这样插值出来的四元数并不是单位四元数

Nlerp

虽然这样插值出来的 $q_t$ 并不是单位四元数，但只要将$q_t$除以它的模长 $||q_t||$ 就能够将其转化为一个单位四元数了

$$
\boldsymbol{v}_t = Nlerp(\boldsymbol{v}_0, \boldsymbol{v}_1, t) = \frac{(1-t)\boldsymbol{v}_0 + t \boldsymbol{v}_1}{|| (1-t)\boldsymbol{v}_0 + t \boldsymbol{v}_1||} \
\boldsymbol{q}_t = Nlerp(\boldsymbol{q}_0, \boldsymbol{q}_1, t) = \frac{(1-t)\boldsymbol{q}_0 + t \boldsymbol{q}_1}{|| (1-t)\boldsymbol{q}_0 + t \boldsymbol{q}_1||}
$$

Nlerp 插值仍然存在有一定的问题，当需要插值的弧比较大时，$\boldsymbol{v}_t$的角速度会有显著的变化．我们可以来看一个例子：

这五个𝑡值将整个弧和弦分割成了四个部分．虽然弦上的四段是等长的，但是四个弧是完全不相等的．𝑡=0到𝑡=0.25之间的弧（红色）明显比𝑡=0.25 到𝑡 =0.50的弧（蓝色）要短了不少

Slerp

为了解决这个问题，我们可以转而对角度进行线性插值．也就是说，如果 $\boldsymbol{v}_1$ 和$\boldsymbol{v}_2$之间的夹角为θ，那么

$$
\theta_t= (1-t) \cdot 0 + t \theta = t \theta
$$

$$
q_t = Slerp(q_0, q_1, t) = \frac{\sin((1-t)\theta)}{\sin(\theta)} q_0 + \frac{\sin(t\theta)}{\sin(\theta)} q_1
$$

其中，

$$
\theta =\cos^{-1}(q_0 \cdot q_1)
$$

注意，这里的 $q_0 \cdot q_1$ 是点积，其结果是一个标量，计算方法为：

$$
q_0 \cdot q_1 = w_0w_1 + x_0x_1 + y_0 y_1 + z_0z_1
$$

而因为任意两个向量（包括四位四元数向量），它们的点积与它们之间的夹角 $\theta$ 以及它们各自的模长（长度）有关：

$$
q_0 \cdot q_1 = ||q_0|| \cdot ||q_1|| \cos(\theta)
$$

如果我们需要对多个四元数进行插值，对每一对四元数使用Slerp 插值虽然能够保证每两个四元数之间的角速度是固定的，但是角速度会在切换插值的四元数时出现断点，或者说在切换点不可导。

Squad

$$
Squad(q_0, q_1, q_2, q_3;t) = Slerp(Slerp(q_0, q_3;t),Slerp(q_1, q_2; t); 2t(1-t))
$$

常见 API

1	public static Quaternion LookRotation (Vector3 forward, Vector3 upwards= Vector3.up);

使用指定的 forward 和 upwards 方向创建旋转。本地的 Z 轴将与给定的，世界中的forward向量对齐，本地的X 轴与 forward 和 upwards 之间的差积对齐，Y 轴与 Z 和 X 之间的差积对齐。

public class Test: MonoBehaviour{
    public Transform sphere;

    void Update(){
        Vector3 dir = sphere.position - transform.position;  //正方体指向球体的向量dir = 球体坐标 - 正方体坐标
        Quaternion ang = Quaternion.LookRotation(dir);  //创建一个 使正方体朝向球体的旋转角
        transform.rotation = ang;  //使正方体朝向球体

        Debug.DrawRay(transform.position, transform.forward * 100, Color.blue);  //绘制正方体forward
        Debug.DrawRay(transform.position, dir, Color.green);  //绘制向量dir
        Debug.DrawRay(transform.position, ang.eulerAngles, Color.red);
    }
}

场景运行Gif

我们需要构建这样一个坐标系，换言之，其被“旋转为”：

Z轴（前方向，z_axis），也就是输入的forward
X轴（右方向， x_axis）,此轴追至于刚刚算出来的Z轴，并垂直于upwards，我们通过叉乘得到
1
Vector3 x_axis = Vector3.Normalize(Vector3.Cross(upwards, z_axis));
Y轴（上方向，y_axis），此轴垂直于刚刚运算得到的Z轴与X轴，
1
y_axis = Vector3.Cross(z_axis, x_axis);

我们期望的旋转矩阵$M$的作用：

我们希望这个矩阵 $M$ 能够将物体的局部坐标系中的向量转换到世界坐标系中。换句话说，当我们用$M$左乘一个局部坐标向量时，我们应该得到这个向量在世界坐标系中的表示。

也就是说，

$M \cdot \mathbf{e}{\mathbf{x}{\text{local}}} = M\begin{pmatrix} 1 \\ 0 \\ 0 \end{pmatrix}$ 此结果为$M$的第一列
$M \cdot \mathbf{e}{\mathbf{y}{\text{local}}} = M\begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix}$此结果为$M$的第二列
$M \cdot \mathbf{e}{\mathbf{z}{\text{local}}} = M\begin{pmatrix} 0 \\ 0 \\ 1 \end{pmatrix}$此结果为$M$的第三列

一个从A坐标系到B坐标系的变换矩阵，其列向量就是A坐标系的基向量在B坐标系中的表示。即使不存在一组欧拉角能够精确构成这个特定的变换矩阵 $M$

$M = \begin{bmatrix} x_{axis}.x & y_{axis}.x & z_{axis}.x \\ x_{axis}.y & y_{axis}.y & z_{axis}.y \\ x_{axis}.z & y_{axis}.z & z_{axis}.z \end{bmatrix}$

假设A是标准的右手坐标系。考虑一个简单的反射变换 $M = \begin{pmatrix} -1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix}$ (沿YZ平面反射，即X轴反向)。 A的X轴 $(1,0,0)_A$ 变换后是 $(-1,0,0)_B$ —— 这是 $M$ 的第一列。 A的Y轴 $(0,1,0)_A$ 变换后是 $(0,1,0)_B$ —— 这是 $M$ 的第二列。 A的Z轴 $(0,0,1)_A$ 变换后是 $(0,0,1)_B$ —— 这是 $M$ 的第三列。由 ${(-1,0,0)_B, (0,1,0)_B, (0,0,1)_B}$ 构成的坐标系B现在是一个左手坐标系。

Posted 2024-03-19Updated 2024-03-25Game Design & Dev / Game Development / Unity3D / WebGL12 minutes read (About 1871 words)

尝试打包WebGL平台遇到的坑

在将一个 Unity 2022.3 数字孪生项目(利用Unity3D_Robotics_ABB)从原 Win 平台迁移到 WebGL 平台的过程中，遇到了 RuntimeError: null function or function signature mismatch、 The script 'xxxxx' could not be instantiated!、digest auth failed 等等报错和故障。在对网上的解决方案探索和尝试的过程中，有小结如下

System.Threading 相关命名空间（C# 原生线程）不支持

这是着手切换WebGL平台之前就知道不行的一点，不过使用 UniTask 可以替换。

原 Unity3D_Robotics_ABB 作者在其实现中主要使用了原生的 Thread 来处理异步的逻辑，在 Unity 打 WebGL 包时并没有报错且能成功出包，然而，在运行 WebGL 的程序时（尤其做某些会拉起线程的操作时），会出现卡死、弹窗报错RuntimeError: null function or function signature mismatch 的问题。

若出现上述问题，应该考虑利用 Unity 托管的异步实现（UniTask、协程等），或 async/await 关键词处理对应需求。

System.Net 相关命名空间不支持（需要利用Unity提供的 UnityWebRequest(UWR) 来进行 HTTP 请求）

这是解决问题花费时间比较长的一点。

在解决完线程的问题后，运行 WebGL 程序，发现会出现 The script 'xxxxx' could not be instantiated! 等控制台警告，且此脚本功能失效。经过测试后，发现脚本中若引用了 System.Net 等 .NET networking classes，则此脚本不能被正确加载，见 WebGL Networking ，此时需要使用 UnityWebRequest 类进行 http 请求。

由于 UnityWebRequest 没有对 Authentication 的支持，需要手动处理一些问题：digest auth failed 。例如，Basic 验证方案可以自己加Header 并存入对应值。然而，对于 Digest 验证方案，手动处理比较麻烦。此时考虑调 JavaScript 的加 Digest 验证的方法并发送 HTTP 请求，或者由代理服务器添加 Digest 验证

技术水平有限，最终选择了由代理服务器添加 Digest 验证的方法，下面是对应的 Python 程序：

from flask import Flask, request, Response
import requests
from requests.auth import HTTPDigestAuth

app = Flask(__name__)

# The URL of the API that requires Digest Authentication
TARGET_API_URL = "http://www.tsingloo.com"
# Replace 'your_username' and 'your_password' with your actual credentials
USERNAME = 'your_username'
PASSWORD = 'your_password'

@app.route('/', defaults={'path': ''})
@app.route('/<path:path>', methods=['GET', 'POST', 'PUT', 'DELETE'])
def proxy(path):
    # Construct the full URL to the target API
    url = f"{TARGET_API_URL}/{path}"
    
   # Log the requested URL
    app.logger.info(f"I am requesting {url}")

    # Get the original request method and data
    method = request.method
    data = request.get_data()
    
    # Selectively copy headers, excluding those that can cause issues
    excluded_headers = ['Host', 'Content-Length', 'Content-Type']
    headers = {key: value for key, value in request.headers if key not in excluded_headers}

    # If the request has a content type, include it in the headers
    if 'Content-Type' in request.headers:
        headers['Content-Type'] = request.headers['Content-Type']

    # Inside your proxy function
    response = requests.request(method, url, data=data, headers=headers)

    # Make sure to copy the content-type header from the original response to the new response
    content_type = response.headers.get('Content-Type')
    response_headers = {'Content-Type': content_type} if content_type else {}

    return Response(response.content, status=response.status_code, headers=response_headers)

@app.before_request
def log_request():
    print(f"Received request: {request.url} - {request.method}")

if __name__ == '__main__':
    app.run(debug=True)

短时间内利用 UWR 对 RWS 发送大量请求

为了追求项目的实时性，短时间内 Unity 利用 UWR 向机械臂(RWS)发送大量请求，前70个请求成功，而后的请求报 “503 Service Unavailable (Too many sessions 70/70)”错，且当双方断开连接时，仍需要一段时间“冷却”，而后才能请求正确收到回复，然后短时间内约70个 UWR 的 http 的请求后，重复上述 503 错误。使用C#提供的 HTTP 相关类时未见此问题。当503报错时，关闭代理服务器，重新插拔网线，都不能立刻释放此70个sessions。

值得注意的是，当 UWR 已经收到503报错时，再继续使用Postman测试相关接口，如(http://192.168.x.x/rw/rapid/tasks/T_ROB1/motion?resource=jointtarget)时，Postman有时可以正常收到数据。复盘此情况，推测是在排障中，由 Postman 已经发送了带 http头”Connection: Keep-Alive”并成功建立了正常的连接，早于 UWR 的请求，换言之，其 70 个session中已经维护了一个与postman的长连接。冷却后，使用 Postman 短时间内大量测试相关接口可以持续获得正常的数据，未见503报错。

冷却后，当使用Edge浏览器短时间内大量访问接口，前70个请求成功，而后的也会503报错，表现与 UWR 一样，且观察浏览器发出的 HTTP 请求，其已携带”Connection: Keep-Alive”，不同的是，浏览器发出的 HTTP 带有很长的“User-Agent”头，这一点没弄明白。

冷却后，当使用 Python 对接口短时间内大量发送简单的http请求（不挟带请求头”Connection: Keep-Alive”）时，前70个请求成功，而后的也会503报错，表现与 UWR 一样。

经检查，（编辑器 2022.3 版本默认的） UWR 发出的 HTTP 请求默认不携带标头 Connection: Keep-Alive，有讨论，见Connection: keep-alive in Unity Web Request?，且当使用 UnityWebRequest.SetRequestHeader 手动设置此标头时，会见警告且官方文档不推荐如此做。

最终，通过 UnityWebRequest.SetRequestHeader 手动设置标头”Connection: Keep-Alive”，解决了上述 503 报错。

几乎同一时刻对 RWS 发送多个请求

在将原线程替换为 UniTask 后，某个方法每帧将同时对 RWS 发送三个不同的 HTTP 请求，导致报 “Connection manager can’t add .. ” 错，改为一帧（约23ms）轮流发一个 HTTP 请求后解决。

HttpClient.PostAsync 与 UnityWebRequest.Post

使用PostAsync时，需要传入一个string和一个httpcontent，在使用 .Post 时，可以是两个string，也可以是一个string和一个 WWWForm，推荐使用后者，可以规避一些可能的编码问题。

反序列化时 dynamic 关键字引起的崩溃

项目中使用从 Unity Package Manager 中添加的 Newtonsoft.Json 来处理请求，原本从字符串中反序列化对象代码如下：

1 2	dynamic obj = JsonConvert.DeserializeObject(jsonStr); var state = obj.name;

初次尝试打WebGL包时会报缺少某命名空间错，而后在项目设置中将 API 兼容性级别改为 .Net Framewrok 后不再报错。
上述代码会导致WebGL运行时弹窗报错奔溃，推荐参考 json 的内容定义类型而后获取对应字段的值(JSON2CSharp)。

[Serializable]
public class People
{
    public string name;
    public int age;
    public DateTime birthday;
    public bool isMarried;
    public float money;
}

public void DeserializeJson()
{
    string jsonStr = "{\"name\" :\"张三\",\"age\":30,\"birthday\":\"1990-03-31T17:15:29\",\"isMarried\":true,\"money\":10.0}";

    People zhangsan = JsonConvert.DeserializeObject<People>(jsonStr);
    Debug.Log("zhangsan.name " + zhangsan.name + " birthday: " + zhangsan.birthday + " isMarried: " + zhangsan.isMarried);

    //output: zhangsan.name 张三 birthday: 03/31/1990 17:15:29 isMarried: True
}

Posted 2023-12-30Updated 2025-06-07技能工具 / 项目 / 个人网站搭建6 minutes read (About 883 words)

利用 Mark Magic 将 Joplin 笔记发布为 Hexo 博客

Refer to Mark Magic, Hexo

Introduction

Mark Magic 是一款可以将来自于一些常用笔记软件的文件数据转化为目标软件 markdown 笔记的工具。我们可以利用它来将 Joplin 中的笔记导出，随后使用 Hexo 生成网页并发布

Setup

需要保证 Joplin Web Clipper 功能打开，可在Joplin 软件中 Tools -> Options -> Web Clipper 处打开。

Mark Magic 要求 Node.js® version must >= 20

Mark Magic

请参考官方文档 Mark Magic, 其给出了详尽的说明与范例。此处以 Joplin 与 Hexo 为例，权且做个记录：

为 Mark Magic 新建文件夹

在此目录下打开命令行，输入以下命令：

1 2	Set-ExecutionPolicy Bypass -Scope Process npm i -D @mark-magic/cli @mark-magic/plugin-joplin @mark-magic/plugin-hexo

安装完毕后，我们在此目录中手动添加 mark-magic.config.yaml 文件

# mark-magic.config.yaml
tasks:
  - name: blog
    input:
      name: '@mark-magic/plugin-joplin' # Input plugin to read data from Joplin notes
      config:
        baseUrl: 'http://localhost:27583' # The address of the Joplin web clipper service, usually http://localhost:41184. Here, we are using the development address http://localhost:27583 for demonstration purposes.
        token: '5bcfa493307XXXXXXXXXXXXXXXXXXXXXXX' # The token for the Joplin web clipper service
        tag: blog # Filter the notes based on a tag
    output:
      name: '@mark-magic/plugin-hexo' # Output plugin to generate the files required by Hexo
      config:
        path: './' # The root directory of the Hexo project
        base: /joplin-hexo-demo/ # The baseUrl when deployed. By default, it is deployed at the root path of the domain. It should match the `root` configuration in the hexo _config.yml file.

命令行中输入 npx mark-magic 运行插件
完毕

Hexo

Hexo 是一款静态博客生成器，其可以将 markdown 文件转换为 html 网页。

安装hexo
1
npm install hexo-cli -g
在空文件夹中，初始化hexo
1
hexo init
注意，在win中，这一步可能会遇到 hexo.ps1 cannot be loaded because running scripts is disabled on this system. For more information 的报错，请在系统设置中 Turn on these settings to execute PowerShell scripts

调整 package.json

{
  "name": "hexo-site",
  "version": "0.0.0",
  "private": true,
  "scripts": {
    "build": "hexo generate",
    "clean": "hexo clean",
    "deploy": "hexo deploy",
    "server": "hexo server"
  },
  "hexo": {
    "version": ""
  },
  "dependencies": {
    "hexo": "^7.0.0",
    "hexo-abbrlink": "^2.2.1",
    "hexo-deployer-cos-cdn": "^1.7.1",
    "hexo-generator-archive": "^2.0.0",
    "hexo-generator-category": "^2.0.0",
    "hexo-generator-index": "^3.0.0",
    "hexo-generator-tag": "^2.0.0",
    "hexo-renderer-ejs": "^2.0.0",
    "hexo-renderer-marked": "^6.2.0",
    "hexo-renderer-stylus": "^3.0.0",
    "hexo-server": "^3.0.0",
    "hexo-theme-landscape": "^1.0.0"
  }
}

选择主题，并按照主题要求安装，以icarus为例
1
2
npm install hexo-theme-icarus
hexo config theme icarus
测试启动hexo
1
2
hexo g
hexo s
配置_config.yml与_config.icarus.yml文件

添加备案信息

在.\node_modules\hexo-theme-icarus\layout\common 下 footer.jsx 文件
写入：

<p className="is-size-7" style={{ display: 'flex', justifyContent: 'center', alignItems: 'center', flexWrap: 'wrap' }}>
    {/* ICP备案 */}
    <a href="https://beian.miit.gov.cn/" target="_blank" rel="noopener">苏ICP备XXXXXX号-1</a>
    
    {/* 两个备案号之间的间隔 */}
    <span style={{ margin: '0 10px' }}>|</span>

    {/* 公安备案 */}
    <a href="https://www.beian.gov.cn/portal/registerSystemInfo?recordcode=XXXXXXX" target="_blank" rel="noopener" style={{ display: 'inline-flex', alignItems: 'center' }}>
        <img src="XXXXX" style={{ height: '1em', marginRight: '5px' }} />
        <span> XXXXXXXXX号</span>
    </a>
</p>

添加备案信息

设置图片居中

参考Hexo博客主题之Icarus的设置与美化（进阶）的方法，打开Icarus主题文件夹，通过相应路径source/js/main.js，找到main.js修改如下。

(function ($) {
    $('.article img:not(".not-gallery-item")').each(function () {
        // wrap images with link and add caption if possible
        if ($(this).parent('a').length === 0) {
//修改部分
            $(this).wrap('<a class="gallery-item" style="display:block;text-align:center;" href="' + $(this).attr('src') + '"></a>');
//修改部分
            if (this.alt) {
                $(this).after('<div class="has-text-centered is-size-6 has-text-grey caption">' + this.alt + '</div>');
            }
        }
    });

.bat

@echo off
REM Change directory to the folder where you want to run "npx mark-magic"
cd "C:\Users\XXXX\mark-magic"
call npx mark-magic

REM Change directory to the folder for "hexo g" and "hexo d"
cd "C:\Users\XXXX\blog"

call hexo clean
call hexo g
call hexo d

REM End of script
echo Script completed! Waiting for 3 seconds...
timeout /t 3 /nobreak >nul

start www.tsingloo.com

REM Close the window
exit

Posted 2023-09-21Updated 2023-09-25随笔 / 随想a minute read (About 156 words)

地铁 · 长望

昏黄的灯光
摇曳的车厢
去向哪呢
熟悉又陌生的地方

再推开门时
是否又都变样
谁知道呢
热烈的太阳也进入他的梦乡
亦有人在星夜下迷失方向

最后一班列车割破踉踉跄跄的夜幕轻唱
只留下疲倦的回响
踱过无尽的回廊
漾起回忆的细浪

旅人盼望着他的床
在前方在高岗在昨夜的故乡
明天又将是怎样的一副景象
一切留给时间酝酿

初露锋芒
不如去他乡遗忘
去漂泊去流浪

Physical Architecture

Warp

Example Ⅰ

Example Ⅱ

Divergence

Synchronization

Golden Role

Deadlock

Revisit of Matrix Multiply

Phase 1: Processing the First Pair of Tiles

Load Tiles into Shared Memory

__syncthreads();

Compute from Shared Memory

__syncthreads();

Phase 2: Processing the Second Pair of Tiles

Load Tiles into Shared Memory

__syncthreads();

Compute from Shared Memory

__syncthreads();

为什么我们可以这么做？

CUDA 函数执行空间限定符

__global__

__device__

for all device code

__host__

组合用法

Grid, Block, Thread

Thread

Treading Model

Thread Hierarchies

Block

Grid

Subdivision of Bezier Curve

Derivatives of Bezier Curves

Why Spline？

三种切线

Continuity

如何构造出 $C^1$ 呢？Mirror 意味着怎么样的关系呢？

$C^2$ 连续性？

Geometric Continuity

$G^{1}$

Curvature $k$

$G^{2}$ 连续性

Hermite Curves

Cardinal Spline

Catmull-Rom Spline

B-Spline(Basis-Spline)

B 样条的递推定义

de Boor-Cox 递推定义，只与 t 有关

例

Trajectory

Conclusion

Handling Non-Uniform Time in Splines

1. 核心变化：全局时间 (T) vs 局部参数 (u)

2. 关键：使用链式法则 (Chain Rule) 重新定义连续性

一阶导数 (速度)

二阶导数 (加速度)

3. 如何修正上述的Splines？

A. 贝塞尔曲线的 $C^1$ 连续性

均匀情况下的 Catmull-Rom

问题：为什么在非均匀时间下会失效？

解决方案：基于“弦速度”的加权平均

均匀情况下的 B-Spline

2. 解决方案：非均匀节点向量 (Non-Uniform Knot Vector)

总结

Spring - Damper

Aerodynamic Forces

Simulation

General Steps

Phase Space

Integration

🔁 Backwards Euler:

$$\frac{\partial \mathbf{f}}{\partial \mathbf{x}} =\begin{bmatrix}\frac{\partial f_1}{\partial x} & \frac{\partial f_1}{\partial y} \\frac{\partial f_2}{\partial x} & \frac{\partial f_2}{\partial y}\end{bmatrix}

Adams Bashforth

Verlet Integration

坐标系

基础

逆和共轭

旋转

四元数， 轴-角， 旋转矩阵， 欧拉角转换

`__syncthreads();`

`__syncthreads();`

`__syncthreads();`

`__syncthreads();`

`global`

`device`

`host`

$$
\frac{\partial \mathbf{f}}{\partial \mathbf{x}} =
\begin{bmatrix}
\frac{\partial f_1}{\partial x} & \frac{\partial f_1}{\partial y} \
\frac{\partial f_2}{\partial x} & \frac{\partial f_2}{\partial y}
\end{bmatrix}

四元数，轴-角，旋转矩阵，欧拉角转换