深海游弋的鱼 – 默默的点滴

从零学习大模型（4）——Transformer 的 “内部齿轮”：FFN、残差连接与归一化如何让 AI 更聪明？

如果把 Transformer 比作一台精密的机器，那么注意力机制是它的 “核心引擎”，而前馈神经网络（FFN）、残差连接（Residual Connection）和归一化（Normalization）就是让引擎高效运转的 “内部齿轮”。这些模块看似简单，却解决了深度学习的两大核心难题 —— 特征提取能力不足和训练不稳定性，是大语言模型能 “理解语言、生成文本” 的关键支撑。

前馈神经网络（FFN）：给注意力结果 “加工提纯”

注意力机制能捕捉词与词的关联（如 “它” 指代 “狗”），但输出的特征向量还需进一步 “加工” 才能被模型有效利用。前馈神经网络（FFN）的作用，就是对注意力的输出进行非线性转换和特征提纯 —— 就像厨师把新鲜食材（注意力结果）做成美味菜肴（可用特征）。

FFN 的核心结构：两层线性变换 + 激活函数

Transformer 中的 FFN 结构非常简洁，通常由两步处理组成。

第一步是线性变换（Linear1），将输入向量从高维压缩到更高维（如从 512 维升到 2048 维）。这一步的作用是 “扩展特征空间”—— 就像用更高分辨率的镜头观察物体，能捕捉更多细节（如 “狗” 不仅有 “动物” 特征，还有 “哺乳动物”“宠物” 等细分特征）。之后经过激活函数（如 ReLU）引入非线性转换，线性变换只能学习简单关系（如 “狗→动物”），而非线性变换能学习复杂关联（如 “狗→宠物→需要喂食”）。

第二步是另一个线性变换（Linear2），将高维向量压缩回原维度（如从 2048 维降回 512 维），这一步是 “特征聚合”—— 把扩展出的细节特征重新整合，形成更精炼的表示。

以 “猫追狗，它跑得很快” 为例，注意力机制已计算出 “它” 与 “狗” 的关联，输出包含关联信息的向量；FFN 通过线性变换扩展特征（如 “狗” 的 “奔跑能力”“被追状态” 等细节），再通过激活函数强化关键特征（如 “奔跑能力”），最后压缩为更有效的向量。

为什么 FFN 是注意力的 “最佳搭档”？

注意力机制擅长 “捕捉关联”，但缺乏 “特征转换” 能力 —— 它输出的向量本质是 “关联加权求和”，特征表达较为粗糙。而 FFN 的优势正在于 “提纯特征”：增强非线性，让模型能学习复杂语义（如隐喻、逻辑推理）；聚焦关键特征，通过维度扩展和压缩，强化重要特征（如 “跑” 与 “狗” 的关联），弱化噪声；补充局部特征，注意力关注全局关联，FFN 则可捕捉局部特征（如 “跑得很快” 中 “跑” 与 “快” 的搭配）。形象说：注意力是 “侦察兵”（找到相关信息），FFN 是 “分析师”（提炼有用信息）。

激活函数：给 FFN 注入 “非线性能力”

激活函数是 FFN 的 “灵魂”—— 没有它，FFN 就退化为线性变换（两层线性变换等价于一层），无法学习复杂特征。ReLU（Rectified Linear Unit）是 Transformer 原始论文的选择，公式为 ReLU (x) = max (0, x)（负数输出 0，正数直接输出），它的优势是计算简单，解决了早期 “Sigmoid 梯度消失” 问题，但存在 “死亡 ReLU” 问题（输入为负时神经元永久失效）。

GELU（Gaussian Error Linear Unit）是 BERT、GPT 等模型的改进选择，公式近似为 0.5x (1 + tanh (√(2/π)(x + 0.044715x³)))，它比 ReLU 更平滑（不会突然输出 0），能保留更多中间特征（如 “跑” 的强度不同时，输出有细微差异），适合需要精细特征的模型（如 BERT 的文本理解、GPT 的生成）。

SwiGLU（Swish-Gated Linear Unit）是大模型（如 LLaMA、GPT-4）的主流选择，公式为 SwiGLU (x) = Swish (x) × Linear (x)（Swish 是带参数的 Sigmoid，这里用线性变换模拟 “门控”），它通过 “门控机制” 动态筛选特征（如 “激活” 有用特征，“抑制” 无关特征），比 GELU 更灵活，在 100 亿参数以上的大模型中，能显著提升生成连贯性和推理能力。

激活函数的选择遵循 “模型越大，越需要灵活激活” 的规律：小模型用 ReLU 足够高效，大模型则需 SwiGLU 的精细调控。

残差连接：让模型 “深而不垮” 的 “桥梁设计”

在深度学习中，模型深度（层数）是提升性能的关键 —— 但传统网络超过一定层数后，会出现 “梯度消失”（训练时参数难以更新）和 “性能下降”（层数增加，精度反而降低）。残差连接（Residual Connection）的发明，彻底解决了这个问题，让 Transformer 能堆叠数十甚至上百层。

核心原理：“跳过连接” 传递原始信息

残差连接的结构极其简单：将模块的输入与输出相加。例如在注意力模块中，输出等于注意力计算结果加上原始输入。这种 “跳过连接” 的作用，可通过一个比喻理解：传统网络中，信息像 “接力赛”—— 每一层必须完美传递信息，否则后面就会 “断档”；残差连接中，信息像 “双车道”—— 一条道是模块处理（如注意力），另一条道是原始信息直接传递。即使模块处理有损失，原始信息仍能通过 “直通道” 到达深层。

为什么残差连接能解决 “梯度消失”？

训练模型时，参数更新依赖 “梯度”（损失对参数的导数）。

传统网络中，梯度需要逐层传递，层数越多，梯度衰减越严重（就像声音在长管道中逐渐减弱）。而残差连接让梯度有了 “捷径”：损失对输入 x 的梯度等于损失对模块输出的梯度加上 1（直接从输出 = 模块输出 + 输入的关系推导）。这意味着梯度不会衰减到 0（至少有 “1” 的基础），深层参数也能有效更新。例如，训练一个 100 层的 Transformer，没有残差连接时，第 100 层的梯度可能衰减到接近 0，参数几乎不更新；有残差连接时，梯度通过 “输出 + 输入” 的路径，能稳定传递到第 1 层，所有层参数都能正常更新。

归一化：让训练 “稳如泰山” 的 “校准工具”

深度学习中，输入向量的数值范围可能剧烈波动（如有的词向量值在 0-1，有的在 100-200）。这种 “数值不稳定” 会导致训练震荡（损失忽高忽低），甚至无法收敛。归一化（Normalization）的作用，就是将向量标准化到固定范围（如均值 0、方差 1），就像给数据 “校准”—— 让模型处理的始终是 “符合预期” 的输入。

Transformer 中最常用的归一化方法是层归一化（Layer Norm，LN），但也有 BN（Batch Norm）、RMSNorm 等变体。理解它们的区别，就能明白为什么 LN 成为 NLP 的主流选择。

LN 与 BN：归一化的 “两种思路”

LN 和 BN 的核心目标相同（标准化数值），但归一化的 “范围” 不同。层归一化（LN）是对单样本内的所有特征进行归一化（如一个句子的 512 维向量），计算方式是对每个样本，计算自身特征的均值和方差。批归一化（BN）则是对批次内的所有样本的同一特征维度进行归一化（如 32 个句子的同一特征维度），计算方式是对每个特征维度，计算批次内所有样本的均值和方差。

为什么文本用 LN，图像用 BN？文本的 “批次一致性” 差：同一批次中，句子长度、语义差异大（如有的是新闻，有的是诗歌），BN 的 “批次均值” 没有意义；而 LN 基于单样本归一化，不受批次影响。图像的 “特征一致性” 强：同一批次的图像（如猫的图片）在同一像素位置（如边缘特征）的数值分布相似，BN 能有效利用这种一致性。

在 Transformer 中，LN 通常紧跟残差连接，形成 “残差 – 归一化” 组合（如输出等于 LN（注意力输出 + 输入））。这种组合既能标准化数值，又能通过残差保留原始信息。

预归一化（Pre-Norm）与后归一化（Post-Norm）：归一化的 “时机选择”

在 Transformer 层中，归一化可以放在模块（注意力或 FFN）之前（Pre-Norm）或之后（Post-Norm），这两种设计对训练稳定性影响很大。Post-Norm（后归一化）是原始 Transformer 的选择，流程是先做模块计算和残差，再进行归一化。这种方式存在问题：模块计算可能导致数值剧烈波动（如注意力的点积可能很大），残差相加后再归一化，仍可能出现训练不稳定（尤其是深层模型）。

Pre-Norm（预归一化）是现代大模型（如 GPT、LLaMA）的选择，流程是先对输入归一化，再做模块计算和残差。这种方式的优势在于：归一化后输入更稳定（均值 0、方差 1），模块计算不易出现数值爆炸，训练更稳定，且能支持更深的层数（如 100 层以上）。实际效果显示，Post-Norm 在 12 层以内表现正常，超过 24 层训练损失容易震荡；而 Pre-Norm 即使堆叠 100 层，损失仍能平稳下降。这也是大模型普遍采用 Pre-Norm 的核心原因。

归一化的 “轻量化” 变体：RMSNorm 与 ScaleNorm

LN 虽稳定，但计算均值和方差的开销较高。研究者们提出了更高效的变体。RMSNorm（Root Mean Square Layer Normalization）是 LLaMA、GPT-3 等模型的选择，它去掉均值计算，只通过 “均方根” 标准化，计算量比 LN 减少 20%（无需减均值），且在语言模型中性能接近 LN。其原理是文本特征的均值通常接近 0（因词向量训练时已中心化），去掉均值对结果影响小。

ScaleNorm 是进一步简化的变体，通过向量的 L2 范数进行标准化，计算更简单（无需统计方差），适合资源受限的场景。但它对输入分布较敏感，在小模型中表现较好。

这些变体的核心思路是：在保证稳定性的前提下，减少计算开销 —— 对大模型而言，每一层的效率提升都会累积成显著优势。

各模块的协同作用：Transformer 的 “流水线设计”

FFN、残差连接、归一化不是孤立存在的，它们在 Transformer 层中形成 “流水线”，共同完成特征处理。

以编码器层为例，完整流程如下：首先接收前一层输出的特征向量作为输入；接着进行预归一化，得到标准化的输入向量（先归一化，保证输入稳定）；然后通过多头注意力模块计算注意力输出（注意力捕捉关联）；之后进行残差连接，将注意力输出与原始输入相加（保留原始信息，避免特征丢失）；再次进行预归一化，为 FFN 提供稳定输入；FFN 处理通过 SwiGLU 激活函数和线性变换提纯特征；最后进行最终残差连接，输出整合了注意力和 FFN 的特征。

这个流程的精妙之处在于：归一化确保每一步输入稳定，避免数值波动；残差连接让信息 “有退路”，深层也能有效传递；FFN 则在稳定的基础上，持续提纯特征。就像工厂流水线：归一化是 “质检校准”，残差连接是 “备用通道”，FFN 是 “精加工”—— 三者协同，让 Transformer 能稳定高效地学习语言规律。

不同模型的模块选择：效率与性能的平衡

模型对 FFN、残差、归一化的选择，体现了 “任务需求 – 模型大小 – 计算资源” 的平衡。GPT-4 等大模型选择 SwiGLU 作为 FFN 激活函数，RMSNorm 作为归一化方式，采用 Pre-Norm 连接设计。因为大模型需精细特征和稳定性，SwiGLU 提升表达，RMSNorm 高效，Pre-Norm 支持深层。

LLaMA 2 等开源模型同样选择 SwiGLU、RMSNorm 和 Pre-Norm，开源模型需兼顾性能与效率，RMSNorm 减少计算，适合部署。BERT 等专注理解任务的模型使用 GELU 激活函数，采用 LN 归一化和改进版 Pre-Norm 连接设计，理解任务需平滑特征，GELU 比 ReLU 更精细，LN 稳定性足够。

轻量模型（如 MobileBERT）则选择 ReLU 作为激活函数，ScaleNorm 作为归一化方式，采用 Pre-Norm 连接设计，移动端需极致效率，ReLU 和 ScaleNorm 计算量最小。

结语：细节决定性能的 “深度学习哲学”

FFN、残差连接、归一化这些模块，看似是 “辅助组件”，却决定了 Transformer 能走多深、跑多快。它们的演进印证了深度学习的一个核心哲学：大模型的能力不仅来自 “规模”（参数和数据），更来自 “细节设计”—— 如何让每一层更稳定，让每一次计算更有效。

从 ReLU 到 SwiGLU，从 Post-Norm 到 Pre-Norm，从 LN 到 RMSNorm，这些微小的改进累积起来，让模型从 “能训练 12 层” 到 “能训练 100 层”，从 “生成生硬文本” 到 “写出流畅文章”。未来，随着模型规模继续扩大，这些 “内部齿轮” 的优化仍将是关键 —— 毕竟，能支撑起千亿参数的，从来不是 “宏大架构”，而是每一个精密的细节。

当我们惊叹于 AI 的语言能力时，或许该记住：让它 “聪明” 的，不仅是注意力机制的 “聚焦”，还有这些模块在背后默默的 “加工、传递与校准”。

参考链接

从零学习大模型（4）——Transformer 的 “内部齿轮”：FFN、残差连接与归一化如何让 AI 更聪明？

llama2.c 源码阅读

1. 概述

前OpenAI著名工程师Andrej Kapathy开源了llama2.c项目，该项目是llama2模型推理代码的C语言实现，用大概970行C代码实现了LLama2模型的推理算法。整个项目代码简洁高效，值得深度阅读。对掌握大模型推理算法的细节有极大的帮助。

2. 源码阅读

2.1 基础算法

RMS归一化公式是：

$$ o_i = w_i \times x_i \times \frac {1}{\sqrt{\frac{1}{n}\sum_{j=0}^{n-1} x_j^2 + \epsilon}} $$

其中，$\epsilon$ 为防止分母为0的数值。还有RMS因子是对x的归一化，w变量是gain变量，重新缩放标准化后的输入向量。

// ----------------------------------------------------------------------------
// neural net blocks; the dynamics of the Transformer
void rmsnorm(float* o, float* x, float* weight, int size) {
    // calculate sum of squares
    float ss = 0.0f;
    for (int j = 0; j < size; j++) {
        ss += x[j] * x[j];
    }
    ss /= size;
    ss += 1e-5f;
    ss = 1.0f / sqrtf(ss);
    // normalize and scale
    for (int j = 0; j < size; j++) {
        o[j] = weight[j] * (ss * x[j]);
    }
}

// ----------------------------------------------------------------------------

// neural net blocks; the dynamics of the Transformer

void rmsnorm(float* o, float* x, float* weight, int size) {

// calculate sum of squares

float ss = 0.0f;

for (int j = 0; j < size; j++) {

ss += x[j] * x[j];

}

ss /= size;

ss += 1e-5f;

ss = 1.0f / sqrtf(ss);

// normalize and scale

for (int j = 0; j < size; j++) {

o[j] = weight[j] * (ss * x[j]);

}

softmax函数公式是：

$$ o_i = \frac {e^{x_i-x_{max}}}{\sum_{j=0}^{n-1} e^{x_j-x_{max}}} $$

代码如下，注释说的很清楚，减去最大值是为了防止数值溢出，数值更稳定。通过简单数学变换可以得知，最终结果不变。

void softmax(float* x, int size) {
    // find max value (for numerical stability)
    float max_val = x[0];
    for (int i = 1; i < size; i++) {
        if (x[i] > max_val) {
            max_val = x[i];
        }
    }
    // exp and sum
    float sum = 0.0f;
    for (int i = 0; i < size; i++) {
        x[i] = expf(x[i] - max_val);
        sum += x[i];
    }
    // normalize
    for (int i = 0; i < size; i++) {
        x[i] /= sum;
    }
}

void softmax(float* x, int size) {

// find max value (for numerical stability)

float max_val = x[0];

for (int i = 1; i < size; i++) {

if (x[i] > max_val) {

max_val = x[i];

}

// exp and sum

float sum = 0.0f;

for (int i = 0; i < size; i++) {

x[i] = expf(x[i] - max_val);

sum += x[i];

}

// normalize

for (int i = 0; i < size; i++) {

x[i] /= sum;

}

W (d,n) @ x (n,) -> xout (d,)的矩阵乘法，采用naive的矩阵乘法，即外层循环是行，内层循环是列。代码如下：

void matmul(float* xout, float* x, float* w, int n, int d) {
    // W (d,n) @ x (n,) -> xout (d,)
    // by far the most amount of time is spent inside this little function
    int i;
    #pragma omp parallel for private(i)
    for (i = 0; i < d; i++) {
        float val = 0.0f;
        for (int j = 0; j < n; j++) {
            val += w[i * n + j] * x[j];
        }
        xout[i] = val;
    }
}

void matmul(float* xout, float* x, float* w, int n, int d) {

// W (d,n) @ x (n,) -> xout (d,)

// by far the most amount of time is spent inside this little function

int i;

#pragma omp parallel for private(i)

for (i = 0; i < d; i++) {

float val = 0.0f;

for (int j = 0; j < n; j++) {

val += w[i * n + j] * x[j];

}

xout[i] = val;

}

2.2. forward计算

模型中一个attention block的计算如下图所示：

项目代码是按照每一个token来计算QKV的，其中参数dim是transformer的向量维度。l是layer序号。

第一步是rmsnorm，即归一化。输入是x (d,)，rms权重向量是w->rms_att_weight + l*dim，计算结果输出到s->xb (d,)中。

// attention rmsnorm
rmsnorm(s->xb, x, w->rms_att_weight + l*dim, dim);

1 2	// attention rmsnorm rmsnorm(s->xb, x, w->rms_att_weight + l*dim, dim);

第二步是QKV的矩阵乘法，注意kv_dim和dim的区别，是为了同时兼容multi head attention和grouped query attention两种算法。如下图所示：

kv_dim是key和value的总维度，dim是transformer的向量总维度。在multi head attention中，kv_dim = dim。在grouped query attention中，kv_dim = dim * n_kv_heads / n_heads。以图中为例，n_kv_heads = 4, n_heads = 8，则kv_dim = dim / 2。

对于各矩阵的维度，以及在MHA、GQA等算法中的关系，参考下图：

Q、K、V三个向量计算的详细代码如下，即Wq(d,d) @ xb(d,) -> q(d,)，Wk(dkv,d) @ xb(d,) -> k(dkv,), Wv(dkv,d) @ xb(d,) -> v(dkv,)

// key and value point to the kv cache
int loff = l * p->seq_len * kv_dim; // kv cache layer offset for convenience
s->k = s->key_cache + loff + pos * kv_dim;
s->v = s->value_cache + loff + pos * kv_dim;

// qkv matmuls for this position
matmul(s->q, s->xb, w->wq + l*dim*dim, dim, dim);
matmul(s->k, s->xb, w->wk + l*dim*kv_dim, dim, kv_dim);
matmul(s->v, s->xb, w->wv + l*dim*kv_dim, dim, kv_dim);

// key and value point to the kv cache

int loff = l * p->seq_len * kv_dim; // kv cache layer offset for convenience

s->k = s->key_cache + loff + pos * kv_dim;

s->v = s->value_cache + loff + pos * kv_dim;

// qkv matmuls for this position

matmul(s->q, s->xb, w->wq + l*dim*dim, dim, dim);

matmul(s->k, s->xb, w->wk + l*dim*kv_dim, dim, kv_dim);

matmul(s->v, s->xb, w->wv + l*dim*kv_dim, dim, kv_dim);

接下来需要给Q和K向量添加RoPE位置编码，按照如下公式计算，其中m就是当前token的序号pos。需要注意的是，llama模型是给每一层的Q和K向量都添加这个编码。

$$ \begin{aligned} \theta_i &= \frac{1}{10000^{2i/hs}}= 10000^{-2i/hs} \\ Q(i) &=Q(i)\cos (m\theta_i) - Q(i+1)\sin(m\theta_i)\\ Q(i+1) &=Q(i)\sin (m \theta_i) + Q(i+1)\cos(m\theta_i)\\ K(i) &=K(i)\cos (m \theta_i) - K(i+1)\sin(m\theta_i)\\ K(i+1) &=K(i)\sin (m \theta_i) + K(i+1)\cos(m\theta_i)\\ \end{aligned} $$

详细代码如下，注意在GQA中，K的向量长度小于Q的向量长度，所以在i < kv_dim时，计算Q和K的向量。在i >= kv_dim时，只计算Q的向量。

// RoPE relative positional encoding: complex-valued rotate q and k in each head
for (int i = 0; i < dim; i+=2) {
    int head_dim = i % head_size;
    float freq = 1.0f / powf(10000.0f, head_dim / (float)head_size);
    float val = pos * freq;
    float fcr = cosf(val);
    float fci = sinf(val);
    int rotn = i < kv_dim ? 2 : 1; // how many vectors? 2 = q & k, 1 = q only
    for (int v = 0; v < rotn; v++) {
        float* vec = v == 0 ? s->q : s->k; // the vector to rotate (query or key)
        float v0 = vec[i];
        float v1 = vec[i+1];
        vec[i]   = v0 * fcr - v1 * fci;
        vec[i+1] = v0 * fci + v1 * fcr;
    }
}

// RoPE relative positional encoding: complex-valued rotate q and k in each head

for (int i = 0; i < dim; i+=2) {

int head_dim = i % head_size;

float freq = 1.0f / powf(10000.0f, head_dim / (float)head_size);

float val = pos * freq;

float fcr = cosf(val);

float fci = sinf(val);

int rotn = i < kv_dim ? 2 : 1; // how many vectors? 2 = q & k, 1 = q only

for (int v = 0; v < rotn; v++) {

float* vec = v == 0 ? s->q : s->k; // the vector to rotate (query or key)

float v0 = vec[i];

float v1 = vec[i+1];

vec[i] = v0 * fcr - v1 * fci;

vec[i+1] = v0 * fci + v1 * fcr;

}

接下来针对每个头，计算attention score。attention score的计算公式如下：

$$ score(i) = softmax(\frac{ Q_i K^T}{\sqrt{d}})V , \quad Q_i \in \R^{1 \times d},K \in \R^{n\times d},V\in\R^{n\times d} $$

具体计算的时候，先遍历每个head，在每个head中，先计算Qi和K的点积，然后除以sqrt(d)，得到att (1,n)向量，最后softmax得到attention score。

在GQA中，由于分组共享了Q和K的向量，在计算attention score的时候，需要把Q和K的向量“展开”还原为(n,d)的矩阵，具体做法是通过h / kv_mul，保证 kv_mul个Q和K向量共享一个权重。

然后计算attention score (1,n)和V (n,d)的乘积，得到xb (1,d)。这个计算并不是完全按照普通矩阵乘来计算的，而是把每个位置的attention score和V的每一行相乘，然后累加到xb中。这样计算的好处是对cache更加友好，是一种常见的矩阵乘算法。

对于每个头，每个token的attention score计算过程的可视化如图所示：

图中可以清楚看出，每个token都计算了一遍和其他token的相关度，再进行加权求和得到最终的attention score。

具体代码如下：

for (h = 0; h < p->n_heads; h++) {
    // get the query vector for this head
    float* q = s->q + h * head_size;
    // attention scores for this head
    float* att = s->att + h * p->seq_len;
    // iterate over all timesteps, including the current one
    for (int t = 0; t <= pos; t++) {
        // get the key vector for this head and at this timestep
        float* k = s->key_cache + loff + t * kv_dim + (h / kv_mul) * head_size;
        // calculate the attention score as the dot product of q and k
        float score = 0.0f;
        for (int i = 0; i < head_size; i++) {
            score += q[i] * k[i];
        }
        score /= sqrtf(head_size);
        // save the score to the attention buffer
        att[t] = score;
    }

    // softmax the scores to get attention weights, from 0..pos inclusively
    softmax(att, pos + 1);

    // weighted sum of the values, store back into xb
    float* xb = s->xb + h * head_size;
    memset(xb, 0, head_size * sizeof(float));
    for (int t = 0; t <= pos; t++) {
        // get the value vector for this head and at this timestep
        float* v = s->value_cache + loff + t * kv_dim + (h / kv_mul) * head_size;
        // get the attention weight for this timestep
        float a = att[t];
        // accumulate the weighted value into xb
        for (int i = 0; i < head_size; i++) {
            xb[i] += a * v[i];
        }
    }
}

for (h = 0; h < p->n_heads; h++) {

// get the query vector for this head

float* q = s->q + h * head_size;

// attention scores for this head

float* att = s->att + h * p->seq_len;

// iterate over all timesteps, including the current one

for (int t = 0; t <= pos; t++) {

// get the key vector for this head and at this timestep

float* k = s->key_cache + loff + t * kv_dim + (h / kv_mul) * head_size;

// calculate the attention score as the dot product of q and k

float score = 0.0f;

for (int i = 0; i < head_size; i++) {

score += q[i] * k[i];

}

score /= sqrtf(head_size);

// save the score to the attention buffer

att[t] = score;

}

// softmax the scores to get attention weights, from 0..pos inclusively

softmax(att, pos + 1);

// weighted sum of the values, store back into xb

float* xb = s->xb + h * head_size;

memset(xb, 0, head_size * sizeof(float));

for (int t = 0; t <= pos; t++) {

// get the value vector for this head and at this timestep

float* v = s->value_cache + loff + t * kv_dim + (h / kv_mul) * head_size;

// get the attention weight for this timestep

float a = att[t];

// accumulate the weighted value into xb

for (int i = 0; i < head_size; i++) {

xb[i] += a * v[i];

}

从代码中也能看出，为什么需要把K和V的矩阵进行cache。因为对于一个位置的token而言，Q矩阵每次参与计算的只有当前位置的一行，而K和V矩阵，则是每行都需要参与计算。最终得到的也是该位置的(1,d)向量作为attention score。因此，为了减少计算量，把K和V矩阵进行cache也是理所当然。

接下来的计算就非常简单，注释也非常直观。详细步骤如下：

计算Wo (d,d) @ xb^T (d,)得到xb2 (d,)
通过残差连接，叠加x (d,)向量：x += xb2
x再经过一个RMSNorm(x)，得到xb (d,)
计算hb和hb2：W1(hd, d) @ xb (d,) -> hb1(hd,) , W3(hd, d) @ xb (d,) -> hb2(hd, )
hb经过silu非线性激活函数变换，计算方式为：$$silu(hb) = hb (1/ (1 + e^{-hb}))$$
然后计算逐位相乘 hb * hb2, 得到hb (hd,)
计算W2(d, hd) @ hb (hd,) -> xb (d,)
最终再通过残差连接，叠加xb向量：x += xb

// final matmul to get the output of the attention
matmul(s->xb2, s->xb, w->wo + l*dim*dim, dim, dim);

// residual connection back into x
for (int i = 0; i < dim; i++) {
    x[i] += s->xb2[i];
}

// ffn rmsnorm
rmsnorm(s->xb, x, w->rms_ffn_weight + l*dim, dim);

// Now for FFN in PyTorch we have: self.w2(F.silu(self.w1(x)) * self.w3(x))
// first calculate self.w1(x) and self.w3(x)
matmul(s->hb, s->xb, w->w1 + l*dim*hidden_dim, dim, hidden_dim);
matmul(s->hb2, s->xb, w->w3 + l*dim*hidden_dim, dim, hidden_dim);

// SwiGLU non-linearity
for (int i = 0; i < hidden_dim; i++) {
    float val = s->hb[i];
    // silu(x)=x*σ(x), where σ(x) is the logistic sigmoid
    val *= (1.0f / (1.0f + expf(-val)));
    // elementwise multiply with w3(x)
    val *= s->hb2[i];
    s->hb[i] = val;
}

// final matmul to get the output of the ffn
matmul(s->xb, s->hb, w->w2 + l*dim*hidden_dim, hidden_dim, dim);

// residual connection
for (int i = 0; i < dim; i++) {
    x[i] += s->xb[i];
}

// final matmul to get the output of the attention

matmul(s->xb2, s->xb, w->wo + l*dim*dim, dim, dim);

// residual connection back into x

for (int i = 0; i < dim; i++) {

x[i] += s->xb2[i];

}

// ffn rmsnorm

rmsnorm(s->xb, x, w->rms_ffn_weight + l*dim, dim);

// Now for FFN in PyTorch we have: self.w2(F.silu(self.w1(x)) * self.w3(x))

// first calculate self.w1(x) and self.w3(x)

matmul(s->hb, s->xb, w->w1 + l*dim*hidden_dim, dim, hidden_dim);

matmul(s->hb2, s->xb, w->w3 + l*dim*hidden_dim, dim, hidden_dim);

// SwiGLU non-linearity

for (int i = 0; i < hidden_dim; i++) {

float val = s->hb[i];

// silu(x)=x*σ(x), where σ(x) is the logistic sigmoid

val *= (1.0f / (1.0f + expf(-val)));

// elementwise multiply with w3(x)

val *= s->hb2[i];

s->hb[i] = val;

}

// final matmul to get the output of the ffn

matmul(s->xb, s->hb, w->w2 + l*dim*hidden_dim, hidden_dim, dim);

// residual connection

for (int i = 0; i < dim; i++) {

x[i] += s->xb[i];

}

继续每一层的计算，每一层的输入都是x，输出也是x，循环计算。在每一层都算完以后，最后再计算：

RMSNorm(x)，把x向量进行归一化。
计算Wc(dvoc, d) @ x (d,) -> logits (dvoc,)，其中dvoc为词典大小。

至此，最终得到的logits就是该位置的在token词典中的分类概率。

// final rmsnorm
rmsnorm(x, x, w->rms_final_weight, dim);

// classifier into logits
matmul(s->logits, x, w->wcls, p->dim, p->vocab_size);
return s->logits;

// final rmsnorm

rmsnorm(x, x, w->rms_final_weight, dim);

// classifier into logits

matmul(s->logits, x, w->wcls, p->dim, p->vocab_size);

return s->logits;

2.3 抽样方法

拿到logits之后，需要通过抽样来最终确定输出哪个token，常见的抽样方法有greedy(argmax)，随机抽样，以及top-p (nucleus) 抽样。

2.3.1 Greedy Sampling

Greedy Sampling是直接选择概率最大的token作为输出。代码简单直观，如下：

int sample_argmax(float* probabilities, int n) {
    // return the index that has the highest probability
    int max_i = 0;
    float max_p = probabilities[0];
    for (int i = 1; i < n; i++) {
        if (probabilities[i] > max_p) {
            max_i = i;
            max_p = probabilities[i];
        }
    }
    return max_i;
}

int sample_argmax(float* probabilities, int n) {

// return the index that has the highest probability

int max_i = 0;

float max_p = probabilities[0];

for (int i = 1; i < n; i++) {

if (probabilities[i] > max_p) {

max_i = i;

max_p = probabilities[i];

}

return max_i;

}

2.3.2 Random Sampling

Random Sampling是随机选择一个token作为输出。代码也很简单，如下：

int sample_mult(float* probabilities, int n, float coin) {
    // sample index from probabilities (they must sum to 1!)
    // coin is a random number in [0, 1), usually from random_f32()
    float cdf = 0.0f;
    for (int i = 0; i < n; i++) {
        cdf += probabilities[i];
        if (coin < cdf) {
            return i;
        }
    }
    return n - 1; // in case of rounding errors
}

int sample_mult(float* probabilities, int n, float coin) {

// sample index from probabilities (they must sum to 1!)

// coin is a random number in [0, 1), usually from random_f32()

float cdf = 0.0f;

for (int i = 0; i < n; i++) {

cdf += probabilities[i];

if (coin < cdf) {

return i;

}

return n - 1; // in case of rounding errors

}

2.3.3 Top-p (Nucleus) Sampling

Top-p (Nucleus) Sampling是随机选择概率大于某个阈值的token作为输出。代码也很简单，如下：

int sample_topp(float* probabilities, int n, float topp, ProbIndex* probindex, float coin) {
    // top-p sampling (or "nucleus sampling") samples from the smallest set of
    // tokens that exceed probability topp. This way we never sample tokens that
    // have very low probabilities and are less likely to go "off the rails".
    // coin is a random number in [0, 1), usually from random_f32()

    int n0 = 0;
    // quicksort indices in descending order of probabilities
    // values smaller than (1 - topp) / (n - 1) cannot be part of the result
    // so for efficiency we crop these out as candidates before sorting
    const float cutoff = (1.0f - topp) / (n - 1);
    for (int i = 0; i < n; i++) {
        if (probabilities[i] >= cutoff) {
            probindex[n0].index = i;
            probindex[n0].prob = probabilities[i];
            n0++;
        }
    }
    qsort(probindex, n0, sizeof(ProbIndex), compare);

    // truncate the list where cumulative probability exceeds topp
    float cumulative_prob = 0.0f;
    int last_idx = n0 - 1; // in case of rounding errors consider all elements
    for (int i = 0; i < n0; i++) {
        cumulative_prob += probindex[i].prob;
        if (cumulative_prob > topp) {
            last_idx = i;
            break; // we've exceeded topp by including last_idx
        }
    }

    // sample from the truncated list
    float r = coin * cumulative_prob;
    float cdf = 0.0f;
    for (int i = 0; i <= last_idx; i++) {
        cdf += probindex[i].prob;
        if (r < cdf) {
            return probindex[i].index;
        }
    }
    return probindex[last_idx].index; // in case of rounding errors
}

int sample_topp(float* probabilities, int n, float topp, ProbIndex* probindex, float coin) {

// top-p sampling (or "nucleus sampling") samples from the smallest set of

// tokens that exceed probability topp. This way we never sample tokens that

// have very low probabilities and are less likely to go "off the rails".

// coin is a random number in [0, 1), usually from random_f32()

int n0 = 0;

// quicksort indices in descending order of probabilities

// values smaller than (1 - topp) / (n - 1) cannot be part of the result

// so for efficiency we crop these out as candidates before sorting

const float cutoff = (1.0f - topp) / (n - 1);

for (int i = 0; i < n; i++) {

if (probabilities[i] >= cutoff) {

probindex[n0].index = i;

probindex[n0].prob = probabilities[i];

n0++;

}

qsort(probindex, n0, sizeof(ProbIndex), compare);

// truncate the list where cumulative probability exceeds topp

float cumulative_prob = 0.0f;

int last_idx = n0 - 1; // in case of rounding errors consider all elements

for (int i = 0; i < n0; i++) {

cumulative_prob += probindex[i].prob;

if (cumulative_prob > topp) {

last_idx = i;

break; // we've exceeded topp by including last_idx

}

// sample from the truncated list

float r = coin * cumulative_prob;

float cdf = 0.0f;

for (int i = 0; i <= last_idx; i++) {

cdf += probindex[i].prob;

if (r < cdf) {

return probindex[i].index;

}

return probindex[last_idx].index; // in case of rounding errors

}

2.3.4 选择抽样策略

具体执行抽样前，需要做一些变换，比如：

除以temperature，用来调整概率分布，温度越高，概率分布越平滑
计算softmax(logits)，得到概率分布代码如下所示：

// apply the temperature to the logits
for (int q=0; q<sampler->vocab_size; q++) { logits[q] /= sampler->temperature; }
// apply softmax to the logits to get the probabilities for next token
softmax(logits, sampler->vocab_size);

// apply the temperature to the logits

for (int q=0; q<sampler->vocab_size; q++) { logits[q] /= sampler->temperature; }

// apply softmax to the logits to get the probabilities for next token

softmax(logits, sampler->vocab_size);

然后根据不同的采样策略，选择不同的采样函数。

2.4 encode和decode

2.4.1 encode

encode函数将输入文本转化为token id序列。token id为int类型，长度为max_len。encode算法非常直观，先是在tokenize词典中查询每个UTF-8字符。如果找不到，则将文本编码为byte fallback。注意每个UTF-8字符长度是1到3个字节之间，需要针对UTF-8编码的规范进行判断。

代码如下：

// process the raw (UTF-8) byte sequence of the input string
for (char *c = text; *c != '\0'; c++) {

    // reset buffer if the current byte is ASCII or a leading byte
    // 0xC0 is 11000000, so (*c & 0xC0) keeps the first 2 bits and zeros the rest
    // 0x80 is 10000000
    // in UTF-8, all continuation bytes start with "10" in first two bits
    // so in English this is: "if this byte is not a continuation byte"
    if ((*c & 0xC0) != 0x80) {
        // this byte must be either a leading byte (11...) or an ASCII char (0x...)
        // => reset our location, as we're starting a new UTF-8 codepoint
        str_len = 0;
    }

    // append the current byte to the buffer
    str_buffer[str_len++] = *c; // ++ is post-increment, incremented after this line
    str_buffer[str_len] = '\0';

    // while the next character is a continuation byte, continue appending
    // but if there are too many of them, just stop to avoid overruning str_buffer size.
    if ((*(c+1) & 0xC0) == 0x80 && str_len < 4) {
        continue;
    }

    // ok c+1 is not a continuation byte, so we've read in a full codepoint
    int id = str_lookup(str_buffer, t->sorted_vocab, t->vocab_size);

    if (id != -1) {
        // we found this codepoint in vocab, add it as a token
        tokens[(*n_tokens)++] = id;
    } else {
        // byte_fallback encoding: just encode each byte as a token
        // +3 is here because the first 3 vocab elements are <unk>, <s>, </s>
        // so the individual bytes only start at index 3
        for (int i=0; i < str_len; i++) {
            tokens[(*n_tokens)++] = (unsigned char)str_buffer[i] + 3;
        }
    }
    str_len = 0; // protect against a sequence of stray UTF8 continuation bytes
}

// process the raw (UTF-8) byte sequence of the input string

for (char *c = text; *c != '\0'; c++) {

// reset buffer if the current byte is ASCII or a leading byte

// 0xC0 is 11000000, so (*c & 0xC0) keeps the first 2 bits and zeros the rest

// 0x80 is 10000000

// in UTF-8, all continuation bytes start with "10" in first two bits

// so in English this is: "if this byte is not a continuation byte"

if ((*c & 0xC0) != 0x80) {

// this byte must be either a leading byte (11...) or an ASCII char (0x...)

// => reset our location, as we're starting a new UTF-8 codepoint

str_len = 0;

}

// append the current byte to the buffer

str_buffer[str_len++] = *c; // ++ is post-increment, incremented after this line

str_buffer[str_len] = '\0';

// while the next character is a continuation byte, continue appending

// but if there are too many of them, just stop to avoid overruning str_buffer size.

if ((*(c+1) & 0xC0) == 0x80 && str_len < 4) {

continue;

}

// ok c+1 is not a continuation byte, so we've read in a full codepoint

int id = str_lookup(str_buffer, t->sorted_vocab, t->vocab_size);

if (id != -1) {

// we found this codepoint in vocab, add it as a token

tokens[(*n_tokens)++] = id;

} else {

// byte_fallback encoding: just encode each byte as a token

// +3 is here because the first 3 vocab elements are <unk>, <s>, </s>

// so the individual bytes only start at index 3

for (int i=0; i < str_len; i++) {

tokens[(*n_tokens)++] = (unsigned char)str_buffer[i] + 3;

}

str_len = 0; // protect against a sequence of stray UTF8 continuation bytes

}

其次，尝试合并临近的字符，并查询tokenize词典，如果存在，则将临近的token缩对应的字符串合并为一个token。并反复迭代，直到找不到相邻的两个token可以合并为一个token为止。代码也很直观，如下：

// merge the best consecutive pair each iteration, according the scores in vocab_scores
while (1) {
    float best_score = -1e10;
    int best_id = -1;
    int best_idx = -1;

    for (int i=0; i < (*n_tokens-1); i++) {
        // check if we can merge the pair (tokens[i], tokens[i+1])
        sprintf(str_buffer, "%s%s", t->vocab[tokens[i]], t->vocab[tokens[i+1]]);
        int id = str_lookup(str_buffer, t->sorted_vocab, t->vocab_size);
        if (id != -1 && t->vocab_scores[id] > best_score) {
            // this merge pair exists in vocab! record its score and position
            best_score = t->vocab_scores[id];
            best_id = id;
            best_idx = i;
        }
    }

    if (best_idx == -1) {
        break; // we couldn't find any more pairs to merge, so we're done
    }

    // merge the consecutive pair (best_idx, best_idx+1) into new token best_id
    tokens[best_idx] = best_id;
    // delete token at position best_idx+1, shift the entire sequence back 1
    for (int i = best_idx+1; i < (*n_tokens-1); i++) {
        tokens[i] = tokens[i+1];
    }
    (*n_tokens)--; // token length decreased
}

// merge the best consecutive pair each iteration, according the scores in vocab_scores

while (1) {

float best_score = -1e10;

int best_id = -1;

int best_idx = -1;

for (int i=0; i < (*n_tokens-1); i++) {

// check if we can merge the pair (tokens[i], tokens[i+1])

sprintf(str_buffer, "%s%s", t->vocab[tokens[i]], t->vocab[tokens[i+1]]);

int id = str_lookup(str_buffer, t->sorted_vocab, t->vocab_size);

if (id != -1 && t->vocab_scores[id] > best_score) {

// this merge pair exists in vocab! record its score and position

best_score = t->vocab_scores[id];

best_id = id;

best_idx = i;

}

if (best_idx == -1) {

break; // we couldn't find any more pairs to merge, so we're done

}

// merge the consecutive pair (best_idx, best_idx+1) into new token best_id

tokens[best_idx] = best_id;

// delete token at position best_idx+1, shift the entire sequence back 1

for (int i = best_idx+1; i < (*n_tokens-1); i++) {

tokens[i] = tokens[i+1];

}

(*n_tokens)--; // token length decreased

}

2.4.2 decode

decode函数将token id序列转化为文本。代码也直观，有一些比较tricky之处，代码也注释清楚：

char* decode(Tokenizer* t, int prev_token, int token) { char *piece = t->vocab[token]; // following BOS (1) token, sentencepiece decoder strips any leading whitespace (see PR #89) if (prev_token == 1 && piece[0] == ' ') { piece++; } // careful, some tokens designate raw bytes, and look like e.g. '<0x01>' // parse this and convert and return the actual byte unsigned char byte_val; if (sscanf(piece, "<0x%0char* decode(Tokenizer* t, int prev_token, int token) {
    char *piece = t->vocab[token];
    // following BOS (1) token, sentencepiece decoder strips any leading whitespace (see PR #89)
    if (prev_token == 1 && piece[0] == ' ') { piece++; }
    // careful, some tokens designate raw bytes, and look like e.g. '<0x01>'
    // parse this and convert and return the actual byte
    unsigned char byte_val;
    if (sscanf(piece, "<0x%02hhX>", &byte_val) == 1) {
        piece = (char*)t->byte_pieces + byte_val * 2;
    }
    return piece;
}

char* decode(Tokenizer* t, int prev_token, int token) { char *piece = t->vocab[token]; // following BOS (1) token, sentencepiece decoder strips any leading whitespace (see PR #89) if (prev_token == 1 && piece[0] == ' ') { piece++; } // careful, some tokens designate raw bytes, and look like e.g. '<0x01>' // parse this and convert and return the actual byte unsigned char byte_val; if (sscanf(piece, "<0x%0char* decode(Tokenizer* t, int prev_token, int token) {

char *piece = t->vocab[token];

// following BOS (1) token, sentencepiece decoder strips any leading whitespace (see PR #89)

if (prev_token == 1 && piece[0] == ' ') { piece++; }

// careful, some tokens designate raw bytes, and look like e.g. '<0x01>'

// parse this and convert and return the actual byte

unsigned char byte_val;

if (sscanf(piece, "<0x%02hhX>", &byte_val) == 1) {

piece = (char*)t->byte_pieces + byte_val * 2;

}

return piece;

}

2.5 文本生成

文本生成是最基础的inference逻辑，对话也是基于文本生成而实现的。整个代码逻辑也非常简单：

将每一个token id逐个进行forward计算
判断当前token位置是否还在prompt长度内，如果不在则执行sampling策略，通过logits向量选取下一个token
否则直接从prompt中读取下一个token。
将下一个token进行decode，并打印出来。

代码详见：

while (pos < steps) {

    // forward the transformer to get logits for the next token
    float* logits = forward(transformer, token, pos);

    // advance the state machine
    if (pos < num_prompt_tokens - 1) {
        // if we are still processing the input prompt, force the next prompt token
        next = prompt_tokens[pos + 1];
    } else {
        // otherwise sample the next token from the logits
        next = sample(sampler, logits);
    }
    pos++;

    // data-dependent terminating condition: the BOS (=1) token delimits sequences
    if (next == 1) { break; }

    // print the token as string, decode it with the Tokenizer object
    char* piece = decode(tokenizer, token, next);
    safe_printf(piece); // same as printf("%s", piece), but skips "unsafe" bytes
    fflush(stdout);
    token = next;

    // init the timer here because the first iteration can be slower
    if (start == 0) { start = time_in_ms(); }
}

while (pos < steps) {

// forward the transformer to get logits for the next token

float* logits = forward(transformer, token, pos);

// advance the state machine

if (pos < num_prompt_tokens - 1) {

// if we are still processing the input prompt, force the next prompt token

next = prompt_tokens[pos + 1];

} else {

// otherwise sample the next token from the logits

next = sample(sampler, logits);

}

pos++;

// data-dependent terminating condition: the BOS (=1) token delimits sequences

if (next == 1) { break; }

// print the token as string, decode it with the Tokenizer object

char* piece = decode(tokenizer, token, next);

safe_printf(piece); // same as printf("%s", piece), but skips "unsafe" bytes

fflush(stdout);

token = next;

// init the timer here because the first iteration can be slower

if (start == 0) { start = time_in_ms(); }

}

2.6 其他

其他部分的代码就是一些简单的数据结构定义，以及helper函数和main函数，这里就不再赘述了。

3. 总结

总体来说，这个项目是一个toy项目，代码逻辑比较简单，但是也提供了非常多的细节参考。特别是兼容了MHA和GQA算法，对于理解这些算法的原理非常有帮助。

但也要看出，这个代码中并没有实现prefill阶段，而是采用逐个token输入的方式填充kv cache。效率的确比较低，但好在逻辑清晰，容易理解。

如果需要进一步优化这个代码，其实有很多可优化点，例如prefill的并行加载优化，减少重复decode等，但这些都超出了这个项目的范围，留给读者自己探索。

参考链接

llama2.c 源码阅读

一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31