真的是很久没有写过杂记了啊。最近工作中突然想到两个小问题，记录一下。

# 法线问题

## 1. 像素(片元)着色器计算光照的空间

片段着色器里的计算都是在世界空间坐标中进行的。

是不是应该把法向量也转换为世界空间坐标？基本正确，但是这不是简单地把它乘以一个模型矩阵就能搞定的。

1. 首先，法向量只是一个方向向量，不能表达空间中的特定位置。同时，法向量没有齐次坐标（顶点位置中的w分量）。这意味着，位移不应该影响到法向量。因此，如果我们打算把法向量乘以一个模型矩阵，我们就要从矩阵中移除位移部分，只选用模型矩阵左上角3×3的矩阵（注意，我们也可以把法向量的w分量设置为0，再乘以4×4矩阵；这同样可以移除位移）。对于法向量，我们只希望对它实施缩放和旋转变换。
2. 其次，如果模型矩阵执行了不等比缩放，顶点的改变会导致法向量不再垂直于表面了。因此，我们不能用这样的模型矩阵来变换法向量

每当我们应用一个不等比缩放时（注意：等比缩放不会破坏法线，因为法线的方向没被改变，仅仅改变了法线的长度，而这很容易通过标准化来修复），法向量就不会再垂直于对应的表面了，这样光照就会被破坏。

修复这个行为的诀窍是使用一个为法向量专门定制的模型矩阵。这个矩阵称之为法线矩阵(Normal Matrix)，它使用了一些线性代数的操作来移除对法向量错误缩放的影响。

法线矩阵被定义为「模型矩阵左上角的逆矩阵的转置矩阵」。真是拗口，如果你不明白这是什么意思，别担心，我们还没有讨论逆矩阵(Inverse Matrix)和转置矩阵(Transpose Matrix)。注意，大部分的资源都会将法线矩阵定义为应用到模型-观察矩阵(Model-view Matrix)上的操作，但是由于我们只在世界空间中进行操作（不是在观察空间），我们只使用模型矩阵。

在顶点着色器中，我们可以使用inversetranspose函数自己生成这个法线矩阵，这两个函数对所有类型矩阵都有效。注意我们还要把被处理过的矩阵强制转换为3×3矩阵，来保证它失去了位移属性以及能够乘以vec3的法向量。

Normal = mat3(transpose(inverse(model))) * aNormal;


## 2. 法线贴图为什么偏蓝

由于法线向量是个几何工具，而纹理通常只用于储存颜色信息，用纹理储存法线向量不是非常直接。如果你想一想，就会知道纹理中的颜色向量用r、g、b元素代表一个3D向量。类似的我们也可以将法线向量的x、y、z元素储存到纹理中，代替颜色的r、g、b元素。法线向量的范围在-1到1之间，所以我们先要将其映射到0到1的范围：

vec3 rgb_normal = normal * 0.5 + 0.5; // 从 [-1,1] 转换至 [0,1]


这会是一种偏蓝色调的纹理（你在网上找到的几乎所有法线贴图都是这样的）。这是因为所有法线的指向都偏向z轴（0, 0, 1）这是一种偏蓝的颜色。法线向量从z轴方向也向其他方向轻微偏移，颜色也就发生了轻微变化，这样看起来便有了一种深度。

## 3. 法线贴图像素(片元)着色器计算光照的空间

法线贴图中的法线向量在切线空间中，法线永远指着正z方向。切线空间是位于三角形表面之上的空间：法线相对于单个三角形的本地参考框架。它就像法线贴图向量的本地空间；它们都被定义为指向正z方向，无论最终变换到什么方向。使用一个特定的矩阵我们就能将本地/切线空寂中的法线向量转成世界或视图坐标，使它们转向到最终的贴图表面的方向。

这种矩阵叫做TBN矩阵这三个字母分别代表tangent、bitangent和normal向量。这是建构这个矩阵所需的向量。要建构这样一个把切线空间转变为不同空间的变异矩阵，我们需要三个相互垂直的向量，它们沿一个表面的法线贴图对齐于：上、右、前。

在GPU中，存在3种分支：

1. Static branching. The condition is based off of compile-time constants; as such, you know from looking at the code and know which branches will be taken. Pretty much any compiler handles this as part of basic optimization.
2. Statically uniform branching. The condition is based off of expressions involving only uniforms or constants. You can’t know a priori which branch will be taken, but the compiler can statically be certain that wavefronts will never be broken by this if.
3. Dynamic branching. The condition is based off of expressions that involve more than just constants and uniforms. Here, a compiler cannot tell a priori if a wavefront will be broken up or not. Whether that happens depends on the runtime evaluation of the condition expression.

而这几种分支在不同图形API下性能情况如下：

Desktop, Pre-D3D10

There, it’s kinda the wild west. NVIDIA’s compiler for such hardware was notorious for detecting such conditions and actually recompiling your shader whenever you changed uniforms that affected such conditions.

In general, this era is where about 80% of the “never use if statements” comes from. But even here, it’s not necessarily true.

You can expect optimization of static branching. You can hope that statically uniform branching won’t cause any additional slowdown (though the fact that NVIDIA thought recompilation would be faster than executing it makes it unlikely at least for their hardware). But dynamic branching is going to cost you something, even if all of the invocations take the same branch.

Compilers of this era do their best to optimize shaders so that simple conditions can be executed simply. For example, your output = inputenable + input2(1-enable); is something that a decent compiler could generate from your equivalent if statement.

Desktop, Post-D3D10

Hardware of this era is generally capable of handling statically uniform branches statements with little slowdown. For dynamic branching, you may or may not encounter slowdown.

Desktop, D3D11+

Hardware of this era is pretty much guaranteed to be able to handle dynamically uniform conditions with little performance issues. Indeed, it doesn’t even have to be dynamically uniform; so long as all of the invocations within the same wavefront take the same path, you won’t see any significant performance loss.

Note that some hardware from the previous epoch probably could do this as well. But this is the one where it’s almost certain to be true.

Mobile, ES 2.0

Welcome back to the wild west. Though unlike Pre-D3D10 desktop, this is mainly due to the huge variance of ES 2.0-caliber hardware. There’s such a huge amount of stuff that can handle ES 2.0, and they all work very differently from each other.

Static branching will likely be optimized. But whether you get good performance from statically uniform branching is very hardware-dependent.

Mobile, ES 3.0+

Hardware here is rather more mature and capable than ES 2.0. As such, you can expect statically uniform branches to execute reasonably well. And some hardware can probably handle dynamic branches the way modern desktop hardware does.