1. 说明
跟 Andrej 大佬的教程实现的极简版 GPT,可以实现续写文章(最后有运行效果)。
2. Jupyter Notebook
In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F
1. 超参设置¶
In [2]:
# 各种超参,可以之后用到了再反过来看
batch_size = 64 # 训练时的并行度
block_size = 256 # 每次基于多长的上下文去预测
max_iters = 5000 # 训练次数
eval_interval = 500 # 每过多少次进行一次损失评估
learning_rate = 3e-4 # 学习率,transformer的学习率不能太高
device = (
"cuda" if torch.cuda.is_available() else "cpu"
) # 使用设备最好有cuda,我用的3070ti这个参数规模要训练大概20分钟
eval_iters = 200 # 每次评估用多少批数据
n_embd = 384 # 每层attention将上下文转化成的向量size的总量(每个注意力头会均分)
n_head = 6 # attention head 数量
n_layer = 6 # 多少层 attention
dropout = 0.2 #
# ------------
torch.manual_seed(1337)
Out[2]:
<torch._C.Generator at 0x263a6eddc30>
2. 准备学习数据¶
2.1 语料准备¶
我这里用的哈利波特前4部,下载链接:
- https://github.com/amephraim/nlp/blob/master/texts/J.%20K.%20Rowling%20-%20Harry%20Potter%201%20-%20Sorcerer%27s%20Stone.txt
- https://github.com/amephraim/nlp/blob/master/texts/J.%20K.%20Rowling%20-%20Harry%20Potter%202%20-%20The%20Chamber%20Of%20Secrets.txt
- https://github.com/amephraim/nlp/blob/master/texts/J.%20K.%20Rowling%20-%20Harry%20Potter%203%20-%20Prisoner%20of%20Azkaban.txt
- https://github.com/amephraim/nlp/blob/master/texts/J.%20K.%20Rowling%20-%20Harry%20Potter%204%20-%20The%20Goblet%20of%20Fire.txt
Andrej 教程用的是莎士比亚文集,没有本质区别,根据喜好选择即可
下载到python同目录,改一下名字即可(’Harry Potter 1.txt’,’Harry Potter 2.txt’,’Harry Potter 3.txt’,’Harry Potter 4.txt’)
In [3]:
# 加载语料
text = ""
for the_file_path in [
"Harry Potter 1.txt",
"Harry Potter 2.txt",
"Harry Potter 3.txt",
"Harry Potter 4.txt",
]:
with open(the_file_path, "r", encoding="latin-1") as f:
tmp_text = f.read()
text = text + tmp_text
In [27]:
# 检查github的哈利波特文本发现其有很多不必要的换行为,剔除,避免学习时误导模型但双换行的地方都是需要保留的
placeholder = "##DOUBLE_NEWLINE##"
text = text.replace("\n\n", placeholder)
# 移除所有单个的'\n'
text = text.replace("\n", "")
# 恢复之前placeholder的成对'\n\n'
text = text.replace(placeholder, "\n\n")
In [28]:
# 检查一下头1000字符j
text[:1000]
Out[28]:
"Harry Potter and the Sorcerer's Stone\n\nCHAPTER ONE\n\nTHE BOY WHO LIVED\n\nMr. and Mrs. Dursley, of number four, Privet Drive, were proud to saythat they were perfectly normal, thank you very much. They were the lastpeople you'd expect to be involved in anything strange or mysterious,because they just didn't hold with such nonsense.\n\nMr. Dursley was the director of a firm called Grunnings, which madedrills. He was a big, beefy man with hardly any neck, although he didhave a very large mustache. Mrs. Dursley was thin and blonde and hadnearly twice the usual amount of neck, which came in very useful as shespent so much of her time craning over garden fences, spying on theneighbors. The Dursleys had a small son called Dudley and in theiropinion there was no finer boy anywhere.\n\nThe Dursleys had everything they wanted, but they also had a secret, andtheir greatest fear was that somebody would discover it. They didn'tthink they could bear it if anyone found out about the Potters. Mrs.Potter was"
In [31]:
# 检查一下后1000字符
text[-1000:]
Out[31]:
'do me one favor, okay? Buy Ron some different dress robes and say they\'re from you."He left the compartment before they could say another word, stepping over Malfoy, Crabbe, and Goyle, who were still lying on the floor, covered in hex marks.\n\nUncle Vernon was waiting beyond the barrier. Mrs. Weasley was close by him. She hugged Harry very tightly when she saw him and whispered in his ear, "I think Dumbledore will let you come to us later in the summer. Keep in touch, Harry.""See you. Harry," said Ron, clapping him on the back."\'Bye, Harry!" said Hermione, and she did something she had never done before, and kissed him on the cheek."Harry - thanks," George muttered, while Fred nodded fervently at his side.Harry winked at them, turned to Uncle Vernon, and followed him silently from the station. There was no point worrying yet, he told himself, as he got into the back of the Dursleys\' car.As Hagrid had said, what would come, would come ... and he would have to meet it when it did.\n\n'
2.2. 语料编码¶
用最简单的方式将各个单字符映射成int
注:ChatGPT 用的是一个专门的 subword 级别的 embeding 方法(需要单独训练模型得到),以后再考虑
In [5]:
# 获取语料中所有字符
chars = sorted(list(set(text)))
print("".join(chars))
!"$%&'()*,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ\]^_`abcdefghijklmnopqrstuvwxyz}~ü
In [6]:
vocab_size = len(chars)
# 直接给编码映射关系
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
# 建立编码和解码函数
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: "".join([itos[i] for i in l])
In [32]:
# 试用一下
encode("hello")
Out[32]:
[68, 65, 72, 72, 75]
In [33]:
# 试用一下
decode([68, 65, 72, 72, 75])
Out[33]:
'hello'
2.3 训练集验证集切割¶
In [7]:
data = torch.tensor(encode(text), dtype=torch.long)
## 分割一下,90%作为训练集,10%用于验证集
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]
2.4 随机获取数据函数¶
方便后续随机梯度下降训练和验证时拿数据
In [ ]:
def get_batch(split):
# 区分训练集和测试集
data = train_data if split == "train" else val_data
# 随机batch_size个序列起始index
ix = torch.randint(len(data) - block_size, (batch_size,))
# 根据起始位置选出batch_size个block_size长的序列
x = torch.stack([data[i : i + block_size] for i in ix])
# 往后偏移一位作为y
y = torch.stack([data[i + 1 : i + block_size + 1] for i in ix])
# 数据存在CPU或者GPU上
x, y = x.to(device), y.to(device)
return x, y
3. 搭建简易GPT模型¶
3.1 损失估计函数¶
用于训练中阶段性监测训练情况的损失估计函数
注1:由于是抽样估计,所以仅是估计函数,而非针对 训练集/验证集 的准确损失函数
注2:由于是用于估计损失,所以没比较浪费算力计算梯度,为此可以加一个 @torch.no_grad()
In [ ]:
@torch.no_grad()
def estimate_loss():
out = {}
# 设置模型为评估模式,通过设置为评估模式,可以确保模型在验证或测试时的行为与训练时保持一致,但去除了训练特有的随机性,从而使评估更加稳定和一致。
# 而训练模式下有些层,如Dropout和Batch Normalization会随时对模型本身进行修改
model.eval()
for split in ["train", "val"]:
losses = torch.zeros(eval_iters)
for k in range(eval_iters):
X, Y = get_batch(split)
# 进行一次预测,model会调用模型的forward()函数
logits, loss = model(X, Y)
# F.cross_entropy() 得到的结果对象还包含很多其他操作和信息,想获得具体的交叉熵的值需要用 loss.item()
losses[k] = loss.item()
# 多次抽样取平均,使评估更加准确
out[split] = losses.mean()
# 回到训练模式
model.train()
return out
3.2 单头注意力机制¶
In [ ]:
class Head(nn.Module):
def __init__(self, head_size):
super().__init__()
# 每个头需要一个 key 矩阵,作用到输入上以后提取输入的特征信息
self.key = nn.Linear(n_embd, head_size, bias=False)
# 每个头需要一个 query 矩阵,作用到输入上以后提取想问的问题
self.query = nn.Linear(n_embd, head_size, bias=False)
# 额外再加一个value矩阵直接对输入进行作用,获得输入想提供的信息
self.value = nn.Linear(n_embd, head_size, bias=False)
# 用一个下三角矩阵,方便后续mask注意力权重矩阵(使每个位置仅注意前面的token)
# 由于它不作为模型的权重参数,不参与训练,所以需要记为 register_buffer ,避免在训练中变化
self.register_buffer("tril", torch.tril(torch.ones(block_size, block_size)))
# 根据<Dropout: A Simple Way to Prevent Neural Networks from Overfitting>
# 训练中随机丢弃一些节点,更可能避免过拟合,也避免模型过于依赖部分节点
# torch 会在推理阶段禁用 dropout 层
# 为了推理阶段和训练阶段权重求和的期望一致,所以在dropout的同时会基于dropout率对结果进行缩放
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# 一次向前传播
# input of size (batch, time-step, channels)
# output of size (batch, time-step, head size)
B, T, C = x.shape
# 每个头需要一个 key 矩阵,作用到x上以后提取x的特征信息
k = self.key(x) # (B,T,hs)
# 每个头需要一个 query 矩阵,作用到x上以后提取想问的问题
q = self.query(x) # (B,T,hs)
# compute attention scores ("affinities")
# 让 k和q进行内积,获得key和query到底有多么匹配(用内积的方式计算匹配度)越匹配越值得被注意
# 为了求内积,需要转置一下后两维 (B,T,hs) @ (B,hs,T) --> (B,T,T)
# 注:忽略B以后,k 和 q 都是由T个行向量组成的矩阵,内积有T*T组,对应T*T结果
# 这个 wei 表征每个位置会多么关心其他各个位置的信息
wei = (
q @ k.transpose(-2, -1) * k.shape[-1] ** -0.5
) # (B, T, hs) @ (B, hs, T) -> (B, T, T)
# 使每个位置仅注意前面的token,后面信息对前面的位置的贡献置为0(softmax(-inf)==0)
wei = wei.masked_fill(self.tril[:T, :T] == 0, float("-inf")) # (B, T, T)
wei = F.softmax(wei, dim=-1) # (B, T, T)
# 向前传播时也随机丢掉一些节点
# 注:当使用 model.eval() 时,PyTorch 会自动禁用 Dropout
wei = self.dropout(wei)
# 计算每个位置愿意提供的信息
v = self.value(x) # (B,T,hs)
# 注意力系数乘以前面各个位置愿意提供的信息,得到每个位置注意到的前面信息
out = wei @ v # (B, T, T) @ (B, T, hs) -> (B, T, hs)
return out
3.2 多头注意力机制¶
并行应用多个单头注意力机制,再把各个头的注意到的信息进行一次交换(通过proj线性层映射)
In [ ]:
class MultiHeadAttention(nn.Module):
def __init__(self, num_heads, head_size):
super().__init__()
# 同时应用多个单头
self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
# 根据 <Attention Is All Your Need> 要求在注意力结算后,还需要一个线性层进行各个头之间的信息交换
self.proj = nn.Linear(head_size * num_heads, n_embd)
# 这里同样在训练中随机丢弃一部分节点
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# 各自计算各个头的注意结果(彼此相互没有信息交换), 吧各自注意到的结果拼接在一起
out = torch.cat([h(x) for h in self.heads], dim=-1)
# 把各个注意头的结果经由 proj 矩阵进行交流
out = self.dropout(self.proj(out))
return out
3.3 前馈神经网络¶
需要有非线性层(避免模型等同于一个矩阵、或梯度消失问题)
选择方式是将特征空间变大,以提取更多信息,ReLU后,再映射回原特征空间(保持稳定的特征维度,方便layer堆叠)
In [ ]:
class FeedFoward(nn.Module):
def __init__(self, n_embd):
super().__init__()
self.net = nn.Sequential(
# 映射到高维特征空间
nn.Linear(n_embd, 4 * n_embd),
nn.ReLU(),
# 映射回去
nn.Linear(4 * n_embd, n_embd),
nn.Dropout(dropout),
)
def forward(self, x):
return self.net(x)
3.4 完整 Transformer 块¶
包含以下组件
- 多头自注意力层(Multi-Head Attention):用于从输入中提取相关性,捕捉全局的信息。
- 残差连接和层归一化(Residual Connection + Layer Normalization):多头注意力层的输出加上输入,然后进行归一化。残差连接帮助梯度流动,避免梯度消失。
- 前馈神经网络(Feed Forward Network, FFN):包含两个线性层和一个 ReLU 激活函数,用于在每个位置上对特征进行非线性变换。前馈网络的作用是增加模型的非线性能力。
- 残差连接和层归一化(Residual Connection + Layer Normalization):前馈网络的输出加上多头注意力的输出,然后再次进行归一化。
In [ ]:
class Block(nn.Module):
def __init__(self, n_embd, n_head):
super().__init__()
# 每个头平均分特征嵌入的维度,以保证拼接起来刚好是嵌入的维度
head_size = n_embd // n_head
# 前面定义好的多头注意力机制
self.sa = MultiHeadAttention(n_head, head_size)
# 前面定义好的前馈神经网络
self.ffwd = FeedFoward(n_embd)
# 归一化后再对每个元素进行缩放和偏置的层(缩放和偏置的权重可学习)
self.ln1 = nn.LayerNorm(n_embd)
# 归一化后再对每个元素进行缩放和偏置的层(缩放和偏置的权重可学习)
self.ln2 = nn.LayerNorm(n_embd)
def forward(self, x):
# 之所以将输入直接加回输出,是为了保证信息流更顺畅,减轻梯度消失或梯度爆炸问题,提高训练效率。
# 或理解为保留一条“高速公路”使梯度能更好地传导。
# 见论文<Deep Residual Learning for Image Recognition>
# 但由于用了加法为避免特征规模越来越大所以需要归一化
# 首先对输入 x 进行层归一化,然后经过多头自注意力层(sa),再与输入 x 相加,形成残差连接
x = x + self.sa(self.ln1(x))
# 再次对输入进行层归一化,经过前馈神经网络(ffwd),再与输入相加,形成第二次残差连接
x = x + self.ffwd(self.ln2(x))
return x
3.5 完整的简易 GPT 模型¶
In [ ]:
class GPTLanguageModel(nn.Module):
def __init__(self):
super().__init__()
# 每个 token 直接通过查找表读取下一个 token 的 logits
# token 嵌入层,将词汇表中的每个 token 映射到嵌入向量
self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
# 位置嵌入层,用于表示每个 token 在序列中的位置
self.position_embedding_table = nn.Embedding(block_size, n_embd)
# 多层 Transformer 块堆叠
self.blocks = nn.Sequential(
*[Block(n_embd, n_head=n_head) for _ in range(n_layer)]
)
# 最终的层归一化和缩放
self.ln_f = nn.LayerNorm(n_embd)
# 映射回词汇表
self.lm_head = nn.Linear(n_embd, vocab_size)
# 更好的权重初始化,这部分在原始 Andrej GPT 视频中没有提到,但很重要,Andrej 会在后续视频中介绍
self.apply(self._init_weights)
# 根据ChatGPT的解释,这种初始化方案源于实际的实验经验,操作和效果如下:
# 1. 防止梯度消失或爆炸:使用正态分布初始化权重(均值为 0,标准差为 0.02),可以确保初始权重不至于太大或者太小。
# 2. 更快收敛:初始化的标准差为 0.02 是经验上的一个好的选择,尤其是在 Transformer 模型中,这样的权重分布有助于让模型更快地找到收敛路径。
# 3. 偏置为零:对于偏置,使用零初始化(torch.nn.init.zeros_())是一种简单有效的方式,它能确保初始时每个神经元的输出都是平等的,不存在偏向。
def _init_weights(self, module):
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
# 向前传播x
def forward(self, idx, targets=None):
B, T = idx.shape
# idx 和 targets 都是形状为 (B, T) 的整数张量
# (B, T, C),词嵌入
tok_emb = self.token_embedding_table(idx)
# (T, C),位置嵌入
pos_emb = self.position_embedding_table(torch.arange(T, device=device))
# (B, T, C),将词嵌入和位置嵌入相加,使x同时具有词性和位置信息
# 注:用加比用cat更好:
# 1. 加不影响维度
# 2. cat其实是特殊的矩阵乘再加的操作。所以对比而言矩阵乘再加更灵活
x = tok_emb + pos_emb
# (B, T, C),通过堆叠的 Transformer 块
x = self.blocks(x)
# (B, T, C),最终的层归一化
x = self.ln_f(x)
# (B, T, vocab_size),通过线性层得到下一个 token 的 logits
logits = self.lm_head(x)
# 区分学习和纯推理,纯推理则不需要计算损失函数h
if targets is None:
loss = None
else:
# 再次获取其三维 batch_size*block_size*vocab_size
B, T, C = logits.shape
# 把前两维展开成一维
logits = logits.view(B * T, C)
# 目标值也展开成一维
targets = targets.view(B * T)
# 计算交叉熵以估计预测的概率分布和实际值的差异(cross_entropy支持一个是概率分布,一个是目标值,这种两个类型和维度的输入)
loss = F.cross_entropy(logits, targets)
return logits, loss
# 生成下一个token预测
def generate(self, idx, max_new_tokens):
# idx is (B, T) array of indices in the current context
# 循环生成 max_new_tokens 次
for _ in range(max_new_tokens):
# 将 idx 裁剪为最后 block_size 个 tokens数量不足的话用0填充(0在前面的编码中对应回车)
idx_cond = idx[:, -block_size:]
# 拿到每一位对下一位的预测结果 logits,也对应 batch_size * block_size * vocab_size 的张量
logits, loss = self(idx_cond)
# 其他抛掉,只看最后一位
logits = logits[:, -1, :] # becomes (B, C)
# 用softmax函数作用(变到[0,1]区间,且和为1),作为预测的各个后继的概率。dim=-1表示沿最后一个维度进行作用
probs = F.softmax(logits, dim=-1) # (B, C)
# 对每个行,根据概率进行抽样得到序号
idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
# 把当前的抽样结果加在序列后面
idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
return idx
4. 训练模型¶
In [8]:
model = GPTLanguageModel()
m = model.to(device)
# 计算并显示参数总量
print(sum(p.numel() for p in m.parameters()) / 1e6, "M parameters")
# 创建优化器,使用了 AdamW(可以动态调整学习率,提升收敛效率)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
for iter in range(max_iters):
# 到损失检测的时候,进行一次损失估计,方便监控收敛情况q
if iter % eval_interval == 0 or iter == max_iters - 1:
losses = estimate_loss()
print(
f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}"
)
# 取一批数据
xb, yb = get_batch("train")
# 计算损失函数
logits, loss = model(xb, yb)
# 将梯度置为0,如果不置0则梯度会累加。因为 PyTorch 在默认情况下对梯度进行累加,以便在某些情况下可以手动进行累积梯度(例如在处理大批量数据时分批计算)
optimizer.zero_grad(set_to_none=True)
# torch 自带的反向传播计算各参数梯度
loss.backward()
# optimizer 对各参数基于其对于损失函数的梯度进行一次更新,更新的step具体值由梯度和优化算法共同决定,如这里的优化算法 AdamW
optimizer.step()
10.808923 M parameters step 0: train loss 4.5501, val loss 4.5495 step 500: train loss 1.7859, val loss 1.7475 step 1000: train loss 1.3751, val loss 1.3425 step 1500: train loss 1.2467, val loss 1.2330 step 2000: train loss 1.1772, val loss 1.1875 step 2500: train loss 1.1267, val loss 1.1483 step 3000: train loss 1.0933, val loss 1.1293 step 3500: train loss 1.0616, val loss 1.1137 step 4000: train loss 1.0352, val loss 1.1044 step 4500: train loss 1.0091, val loss 1.0926 step 4999: train loss 0.9891, val loss 1.0874
5. 模型参数保存或加载¶
In [9]:
# 保存
torch.save(model.state_dict(), "model_parameters_harrypotter.pth")
In [10]:
# 加载保存的参数c
model = GPTLanguageModel()
m = model.to(device)
# 加载已保存的参数
m.load_state_dict(torch.load("model_parameters_harrypotter.pth", weights_only=True))
Out[10]:
<All keys matched successfully>
6. 用 GPT 模型生成后文¶
In [21]:
print(encode(["O"]))
[44]
In [41]:
# 尝试生成
# 以 'I' 开头
context = torch.zeros((1, 1), dtype=torch.long, device=device)
context[0][0] = encode(["I"])[0]
# context[0][1] = 37
# context[0][2] = 44
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))
Iweared as they stretch of him. "Sonty, Ronad Howler, you dident them into there?" and Her conduction. He looked forward. Her hand at the other. "So what d'you think here?" Black opened the steering back onto the chocolate. "That's been known of wizards, next to his feet,Harry Potter, Siriuta." "You shut are true," said Harry. Dudley had to go all so muchtember, but coursing picking up inthe last years to be opponing. "I'm stuggling it," he said importantly, softrying a fewer squashy and still gruly smarking. "What d'you don't taught," Harry asked, "It's not far 'not, jus' better, it's the bethat off' the powers. Ot it -- us you could tall kill yeh!Reluch? Got I'd find tonight on to join yousit *250* espectacular outside corridors. "But Dumbledore's master!" said "The Colin Curesures about the Hall and jinxed." The Slythering worder lit down the air, dennig stupidly through the shelveshudder through the gave Harry holding him a secret and undolughthe tip. Speed highly in the sunken air kneed his wand. There was a dragon, ugly, Harry kimal jumped tables, and lightly pulled off the door of the doors behind the floor. Fred and George were still completely. "You don't stop?" she said to Harry. "Always don't know what we might do," he said, gestrucingly toward Uncle Vernon's, who didn't have to gamekeep was. "Nothing to pockedme .... last thing...." "Whenever I set them of ter several great emersed," said Black Dobby amid in a made turningProfessor McGonagall. "He's entered ter exasperate." "Which told me," said Harry. "I could usu my going find very redient something's life...." "When Diagon Magicall and both, of Octetion, dear, but Harry had used to believe your last defaint youback -- of course,as you see. Flee,you're" "Way s this of us?" said Harry. "You couldn't use what I did... " "Will show -- he's got to there... this better the last of house --" Harry strode out of her robes. Both dog from himself behind him. Through wanting from the charge,
7. 备注¶
虽然生成的内容依旧没有太大含义,但至少看起来有模有样了,比之前的 Bigram 好了太多。
预期将当前的极简版按字符 encode 和 decode 的方式换为 ChatGPT 用的 subword 级会再好不少。
然而要进一步变强,使输出有价值,则可能需要扩大模型规模,增加训练数据了。