1. 说明
完整路径
- 先实现一个简易 GPT 模型,见:https://pangruitao.com/post/4920
- 再学习 Tokenizer 的方法,实现了一个直接基于 BPE 算法的 Tokenizer :https://pangruitao.com/post/4960
- 训练 Tokenizer 以后,将训练结果保存到文件,供后续 GPT 使用
- 修改 GPT 词汇表大小和前后处理,以利用之前训练的 Tokenizer:见本文
2. Jupyter Notebook
In [1]:
import pickle
import torch
import torch.nn as nn
from torch.nn import functional as F
1. 超参设置¶
In [2]:
# 各种超参,可以之后用到了再反过来看
batch_size = 64 # 训练时的并行度
block_size = 256 # 每次基于多长的上下文去预测
# max_iters = 5000 # 训练次数
eval_interval = 200 # 每过多少次进行一次损失评估
learning_rate = 3e-4 # 学习率,transformer的学习率不能太高
device = (
"cuda" if torch.cuda.is_available() else "cpu"
) # 使用设备最好有cuda,我用的3070ti这个参数规模要训练大概20分钟
eval_iters = 200 # 每次评估用多少批数据
n_embd = 384 # 每层attention将上下文转化成的向量size的总量(每个注意力头会均分)
n_head = 6 # attention head 数量
n_layer = 6 # 多少层 attention
dropout = 0.2 #
# ------------
torch.manual_seed(1337)
Out[2]:
<torch._C.Generator at 0x1ed06609c30>
2. 准备学习数据¶
2.1 语料准备¶
我这里用的哈利波特1-7部,下载链接:
Andrej 教程用的是莎士比亚文集,没有本质区别,根据喜好选择即可
下载到python同目录,改一下名字即可(’Harry Potter 1-7.txt’)
In [3]:
# 加载语料
text = ""
the_file_path = "Harry Potter 1-7.txt"
with open(the_file_path, "r", encoding="ansi") as f:
text = f.read()
In [4]:
# 检查一下头1000字符j
text[:1000]
Out[4]:
"1.Harry Potter and the Sorcerer's Stone.txt\n\n\u3000\u3000Harry Potter and the Sorcerer's Stone\n\u3000\u3000CHAPTER ONE\n\u3000\u3000THE BOY WHO LIVED\n\u3000\u3000Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.\n\u3000\u3000Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere.\n\u3000\u3000The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they coul"
In [5]:
# 检查发现文本的换行不统一,有些地方仅一次回车有些地方又两次,修改一下统一为两次回车t
placeholder = "##DOUBLE_NEWLINE##"
text = text.replace("\n\n", placeholder)
# 将所有单个的'\n'换为'\n\n'
text = text.replace("\n", "\n\n")
# 恢复之前placeholder的成对'\n\n'
text = text.replace(placeholder, "\n\n")
# 再删除文中的一些全角空格
text = text.replace("\u3000", "")
In [6]:
len(text)
Out[6]:
6360384
In [7]:
# 检查一下头1000字符j
text[:1000]
Out[7]:
"1.Harry Potter and the Sorcerer's Stone.txt\n\nHarry Potter and the Sorcerer's Stone\n\nCHAPTER ONE\n\nTHE BOY WHO LIVED\n\nMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.\n\nMr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere.\n\nThe Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear "
In [8]:
# # 检查一下后1000字符
# text[-1000:]
2.2. Tokenization¶
用我们基于 BPE 训练的 Tokenizer 进行编码和后续解码
In [9]:
# 从文件中加载模型
with open("bpe_data.pkl", "rb") as file:
loaded_data = pickle.load(file)
merges = loaded_data["merges"]
vocab = loaded_data["vocab"]
In [10]:
# merges
In [11]:
# vocab
In [12]:
def get_stats(ids, counts=None):
"""
获取ids中各个pair的出现次数
Example: [1, 2, 3, 1, 2] -> {(1, 2): 2, (2, 3): 1, (3, 1): 1}
"""
counts = {} if counts is None else counts
for pair in zip(ids, ids[1:]): # iterate consecutive elements
counts[pair] = counts.get(pair, 0) + 1
return counts
In [13]:
# 替换函数
def merge(ids, pair, idx):
"""
将 ids 序列中的 pair 替换为 idx
Example: ids=[1, 2, 3, 1, 2], pair=(1, 2), idx=4 -> [4, 3, 4]
"""
newids = []
i = 0
while i < len(ids):
# if not at the very last position AND the pair matches, replace it
if ids[i] == pair[0] and i < len(ids) - 1 and ids[i + 1] == pair[1]:
newids.append(idx)
i += 2
else:
newids.append(ids[i])
i += 1
return newids
In [14]:
# 由原始文本到token ids
def encode(orginal_text, merges):
# 经由原始字符转化为token id
text_bytes = orginal_text.encode("utf-8") # raw bytes
ids = list(text_bytes) # list of integers in range 0..255
while len(ids) >= 2:
# 从前往后替换 (和训练时保持一致的先后顺序)
stats = get_stats(ids)
pair = min(stats, key=lambda p: merges.get(p, float("inf")))
# 如果没有则返回inf(最不优先替换)
# 但如果全部都是inf的话 min 会返回第一个需要判断下
if pair not in merges:
break # 替换完成
# 替换最早的那个pair
idx = merges[pair]
ids = merge(ids, pair, idx)
# 监控进度
if idx % 10 == 0:
print(f"done with id {idx}")
return ids
In [15]:
# 由token ids 到原始文本
def decode(ids, vocab):
# given ids (list of integers), return Python string
text_bytes = b"".join(vocab[idx] for idx in ids)
# 如果decode有问题(如utf-8编码下第一个byte不可能是128)则用一个特殊字符代替
text = text_bytes.decode("utf-8", errors="replace")
return text
In [16]:
vocab_size = len(vocab)
vocab_size
Out[16]:
360
In [17]:
# 试用一下
encode("hello", merges)
done with id 280
Out[17]:
[352, 280, 111]
In [18]:
# 试用一下
decode([335, 277, 111], vocab)
Out[18]:
'edeno'
2.3 训练集验证集切割¶
In [19]:
# 先 encode
data = torch.tensor(encode(text, merges), dtype=torch.long)
done with id 260 done with id 270 done with id 280 done with id 290 done with id 300 done with id 310 done with id 320 done with id 330 done with id 340 done with id 350
In [20]:
## 分割一下,90%作为训练集,10%用于验证集
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]
2.4 随机获取数据函数¶
方便后续随机梯度下降训练和验证时拿数据
In [21]:
def get_batch(split):
# 区分训练集和测试集
data = train_data if split == "train" else val_data
# 随机batch_size个序列起始index
ix = torch.randint(len(data) - block_size, (batch_size,))
# 根据起始位置选出batch_size个block_size长的序列
x = torch.stack([data[i : i + block_size] for i in ix])
# 往后偏移一位作为y
y = torch.stack([data[i + 1 : i + block_size + 1] for i in ix])
# 数据存在CPU或者GPU上
x, y = x.to(device), y.to(device)
return x, y
3. 搭建简易GPT模型¶
3.1 损失估计函数¶
用于训练中阶段性监测训练情况的损失估计函数
注1:由于是抽样估计,所以仅是估计函数,而非针对 训练集/验证集 的准确损失函数
注2:由于是用于估计损失,所以没比较浪费算力计算梯度,为此可以加一个 @torch.no_grad()
In [22]:
@torch.no_grad()
def estimate_loss():
out = {}
# 设置模型为评估模式,通过设置为评估模式,可以确保模型在验证或测试时的行为与训练时保持一致,但去除了训练特有的随机性,从而使评估更加稳定和一致。
# 而训练模式下有些层,如Dropout和Batch Normalization会随时对模型本身进行修改
model.eval()
for split in ["train", "val"]:
losses = torch.zeros(eval_iters)
for k in range(eval_iters):
X, Y = get_batch(split)
# 进行一次预测,model会调用模型的forward()函数
logits, loss = model(X, Y)
# F.cross_entropy() 得到的结果对象还包含很多其他操作和信息,想获得具体的交叉熵的值需要用 loss.item()
losses[k] = loss.item()
# 多次抽样取平均,使评估更加准确
out[split] = losses.mean()
# 回到训练模式
model.train()
return out
3.2 单头注意力机制¶
In [23]:
class Head(nn.Module):
def __init__(self, head_size):
super().__init__()
# 每个头需要一个 key 矩阵,作用到输入上以后提取输入的特征信息
self.key = nn.Linear(n_embd, head_size, bias=False)
# 每个头需要一个 query 矩阵,作用到输入上以后提取想问的问题
self.query = nn.Linear(n_embd, head_size, bias=False)
# 额外再加一个value矩阵直接对输入进行作用,获得输入想提供的信息
self.value = nn.Linear(n_embd, head_size, bias=False)
# 用一个下三角矩阵,方便后续mask注意力权重矩阵(使每个位置仅注意前面的token)
# 由于它不作为模型的权重参数,不参与训练,所以需要记为 register_buffer ,避免在训练中变化
self.register_buffer("tril", torch.tril(torch.ones(block_size, block_size)))
# 根据<Dropout: A Simple Way to Prevent Neural Networks from Overfitting>
# 训练中随机丢弃一些节点,更可能避免过拟合,也避免模型过于依赖部分节点
# torch 会在推理阶段禁用 dropout 层
# 为了推理阶段和训练阶段权重求和的期望一致,所以在dropout的同时会基于dropout率对结果进行缩放
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# 一次向前传播
# input of size (batch, time-step, channels)
# output of size (batch, time-step, head size)
B, T, C = x.shape
# 每个头需要一个 key 矩阵,作用到x上以后提取x的特征信息
k = self.key(x) # (B,T,hs)
# 每个头需要一个 query 矩阵,作用到x上以后提取想问的问题
q = self.query(x) # (B,T,hs)
# compute attention scores ("affinities")
# 让 k和q进行内积,获得key和query到底有多么匹配(用内积的方式计算匹配度)越匹配越值得被注意
# 为了求内积,需要转置一下后两维 (B,T,hs) @ (B,hs,T) --> (B,T,T)
# 注:忽略B以后,k 和 q 都是由T个行向量组成的矩阵,内积有T*T组,对应T*T结果
# 这个 wei 表征每个位置会多么关心其他各个位置的信息
wei = (
q @ k.transpose(-2, -1) * k.shape[-1] ** -0.5
) # (B, T, hs) @ (B, hs, T) -> (B, T, T)
# 使每个位置仅注意前面的token,后面信息对前面的位置的贡献置为0(softmax(-inf)==0)
wei = wei.masked_fill(self.tril[:T, :T] == 0, float("-inf")) # (B, T, T)
wei = F.softmax(wei, dim=-1) # (B, T, T)
# 向前传播时也随机丢掉一些节点
# 注:当使用 model.eval() 时,PyTorch 会自动禁用 Dropout
wei = self.dropout(wei)
# 计算每个位置愿意提供的信息
v = self.value(x) # (B,T,hs)
# 注意力系数乘以前面各个位置愿意提供的信息,得到每个位置注意到的前面信息
out = wei @ v # (B, T, T) @ (B, T, hs) -> (B, T, hs)
return out
3.2 多头注意力机制¶
并行应用多个单头注意力机制,再把各个头的注意到的信息进行一次交换(通过proj线性层映射)
In [24]:
class MultiHeadAttention(nn.Module):
def __init__(self, num_heads, head_size):
super().__init__()
# 同时应用多个单头
self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
# 根据 <Attention Is All Your Need> 要求在注意力结算后,还需要一个线性层进行各个头之间的信息交换
self.proj = nn.Linear(head_size * num_heads, n_embd)
# 这里同样在训练中随机丢弃一部分节点
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# 各自计算各个头的注意结果(彼此相互没有信息交换), 吧各自注意到的结果拼接在一起
out = torch.cat([h(x) for h in self.heads], dim=-1)
# 把各个注意头的结果经由 proj 矩阵进行交流
out = self.dropout(self.proj(out))
return out
3.3 前馈神经网络¶
需要有非线性层(避免模型等同于一个矩阵、或梯度消失问题)
选择方式是将特征空间变大,以提取更多信息,ReLU后,再映射回原特征空间(保持稳定的特征维度,方便layer堆叠)
In [25]:
class FeedFoward(nn.Module):
def __init__(self, n_embd):
super().__init__()
self.net = nn.Sequential(
# 映射到高维特征空间
nn.Linear(n_embd, 4 * n_embd),
nn.ReLU(),
# 映射回去
nn.Linear(4 * n_embd, n_embd),
nn.Dropout(dropout),
)
def forward(self, x):
return self.net(x)
3.4 完整 Transformer 块¶
包含以下组件
- 多头自注意力层(Multi-Head Attention):用于从输入中提取相关性,捕捉全局的信息。
- 残差连接和层归一化(Residual Connection + Layer Normalization):多头注意力层的输出加上输入,然后进行归一化。残差连接帮助梯度流动,避免梯度消失。
- 前馈神经网络(Feed Forward Network, FFN):包含两个线性层和一个 ReLU 激活函数,用于在每个位置上对特征进行非线性变换。前馈网络的作用是增加模型的非线性能力。
- 残差连接和层归一化(Residual Connection + Layer Normalization):前馈网络的输出加上多头注意力的输出,然后再次进行归一化。
In [26]:
class Block(nn.Module):
def __init__(self, n_embd, n_head):
super().__init__()
# 每个头平均分特征嵌入的维度,以保证拼接起来刚好是嵌入的维度
head_size = n_embd // n_head
# 前面定义好的多头注意力机制
self.sa = MultiHeadAttention(n_head, head_size)
# 前面定义好的前馈神经网络
self.ffwd = FeedFoward(n_embd)
# 归一化后再对每个元素进行缩放和偏置的层(缩放和偏置的权重可学习)
self.ln1 = nn.LayerNorm(n_embd)
# 归一化后再对每个元素进行缩放和偏置的层(缩放和偏置的权重可学习)
self.ln2 = nn.LayerNorm(n_embd)
def forward(self, x):
# 之所以将输入直接加回输出,是为了保证信息流更顺畅,减轻梯度消失或梯度爆炸问题,提高训练效率。
# 或理解为保留一条“高速公路”使梯度能更好地传导。
# 见论文<Deep Residual Learning for Image Recognition>
# 但由于用了加法为避免特征规模越来越大所以需要归一化
# 首先对输入 x 进行层归一化,然后经过多头自注意力层(sa),再与输入 x 相加,形成残差连接
x = x + self.sa(self.ln1(x))
# 再次对输入进行层归一化,经过前馈神经网络(ffwd),再与输入相加,形成第二次残差连接
x = x + self.ffwd(self.ln2(x))
return x
3.5 完整的简易 GPT 模型¶
In [27]:
class GPTLanguageModel(nn.Module):
def __init__(self):
super().__init__()
# 每个 token 直接通过查找表读取下一个 token 的 logits
# token 嵌入层,将词汇表中的每个 token 映射到嵌入向量
self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
# 位置嵌入层,用于表示每个 token 在序列中的位置
self.position_embedding_table = nn.Embedding(block_size, n_embd)
# 多层 Transformer 块堆叠
self.blocks = nn.Sequential(
*[Block(n_embd, n_head=n_head) for _ in range(n_layer)]
)
# 最终的层归一化和缩放
self.ln_f = nn.LayerNorm(n_embd)
# 映射回词汇表
self.lm_head = nn.Linear(n_embd, vocab_size)
# 更好的权重初始化,这部分在原始 Andrej GPT 视频中没有提到,但很重要,Andrej 会在后续视频中介绍
self.apply(self._init_weights)
# 根据ChatGPT的解释,这种初始化方案源于实际的实验经验,操作和效果如下:
# 1. 防止梯度消失或爆炸:使用正态分布初始化权重(均值为 0,标准差为 0.02),可以确保初始权重不至于太大或者太小。
# 2. 更快收敛:初始化的标准差为 0.02 是经验上的一个好的选择,尤其是在 Transformer 模型中,这样的权重分布有助于让模型更快地找到收敛路径。
# 3. 偏置为零:对于偏置,使用零初始化(torch.nn.init.zeros_())是一种简单有效的方式,它能确保初始时每个神经元的输出都是平等的,不存在偏向。
def _init_weights(self, module):
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
# 向前传播x
def forward(self, idx, targets=None):
B, T = idx.shape
# idx 和 targets 都是形状为 (B, T) 的整数张量
# (B, T, C),词嵌入
tok_emb = self.token_embedding_table(idx)
# (T, C),位置嵌入
pos_emb = self.position_embedding_table(torch.arange(T, device=device))
# (B, T, C),将词嵌入和位置嵌入相加,使x同时具有词性和位置信息
# 注:用加比用cat更好:
# 1. 加不影响维度
# 2. cat其实是特殊的矩阵乘再加的操作。所以对比而言矩阵乘再加更灵活
x = tok_emb + pos_emb
# (B, T, C),通过堆叠的 Transformer 块
x = self.blocks(x)
# (B, T, C),最终的层归一化
x = self.ln_f(x)
# (B, T, vocab_size),通过线性层得到下一个 token 的 logits
logits = self.lm_head(x)
# 区分学习和纯推理,纯推理则不需要计算损失函数h
if targets is None:
loss = None
else:
# 再次获取其三维 batch_size*block_size*vocab_size
B, T, C = logits.shape
# 把前两维展开成一维
logits = logits.view(B * T, C)
# 目标值也展开成一维
targets = targets.view(B * T)
# 计算交叉熵以估计预测的概率分布和实际值的差异(cross_entropy支持一个是概率分布,一个是目标值,这种两个类型和维度的输入)
loss = F.cross_entropy(logits, targets)
return logits, loss
# 生成下一个token预测
def generate(self, idx, max_new_tokens):
# idx is (B, T) array of indices in the current context
# 循环生成 max_new_tokens 次
for _ in range(max_new_tokens):
# 将 idx 裁剪为最后 block_size 个 tokens数量不足的话用0填充(0在前面的编码中对应回车)
idx_cond = idx[:, -block_size:]
# 拿到每一位对下一位的预测结果 logits,也对应 batch_size * block_size * vocab_size 的张量
logits, loss = self(idx_cond)
# 其他抛掉,只看最后一位
logits = logits[:, -1, :] # becomes (B, C)
# 用softmax函数作用(变到[0,1]区间,且和为1),作为预测的各个后继的概率。dim=-1表示沿最后一个维度进行作用
probs = F.softmax(logits, dim=-1) # (B, C)
# 对每个行,根据概率进行抽样得到序号
idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
# 把当前的抽样结果加在序列后面
idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
return idx
4. 训练模型¶
In [28]:
model = GPTLanguageModel()
m = model.to(device)
# 计算并显示参数总量
print(sum(p.numel() for p in m.parameters()) / 1e6, "M parameters")
11.015784 M parameters
In [29]:
# 创建优化器,使用了 AdamW(可以动态调整学习率,提升收敛效率)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
In [34]:
def train(train_times):
for iter in range(train_times):
# 到损失检测的时候,进行一次损失估计,方便监控收敛情况q
if iter % eval_interval == 0 or iter == train_times - 1:
losses = estimate_loss()
print(
f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}"
)
# 取一批数据
xb, yb = get_batch("train")
# 计算损失函数
logits, loss = model(xb, yb)
# 将梯度置为0,如果不置0则梯度会累加。因为 PyTorch 在默认情况下对梯度进行累加,以便在某些情况下可以手动进行累积梯度(例如在处理大批量数据时分批计算)
optimizer.zero_grad(set_to_none=True)
# torch 自带的反向传播计算各参数梯度
loss.backward()
# optimizer 对各参数基于其对于损失函数的梯度进行一次更新,更新的step具体值由梯度和优化算法共同决定,如这里的优化算法 AdamW
optimizer.step()
In [32]:
# 先试试未训练时的生成效果
# 以 'I ' 开头
context = torch.zeros((1, 2), dtype=torch.long, device=device)
context[0][0] = encode("I", merges)[0]
context[0][1] = encode(" ", merges)[0]
print(decode(m.generate(context, max_new_tokens=1000)[0].tolist(), vocab))
I ed ���owęed_reoral�owghearo-one ��uar�K�lowAyou H or)�it�it�hanothe )nHarrll*ly ����B,le riarere ��in T���, �R�ing d�lowP��k �ve s�/�I��er����Bere " Ї�one �ed ��aid �sve oHarrirw_'��al9ostIaive >�r�s sTe, re�os you acof ai�th was �a ione youv�the�he �<vc�the��;sslesing un( said �O�inIs to King`f��en ސ. HarrarrHen � uer �cRghgޑ�thlithe �oo�sa t1���l)noGdk?���Rer0]��Y�Harry 0s�k ghn�l�l��jhad rie%eous 7�~_vesicwas ) er a0�b��roar�in Harr�rok �bve riV's ��ea�the �<eapt thehe ter�you �oohaPle�y, ri9'hiuy, -gӛat�foweawa~ing#the <�=, �ss berdd t @rto e, `algh-e, ��rk ��0i���noo}ou#�o�C�We=of �?wa�. it ��Harry �C��ve �����mto �Harrto �����vchlicbing�ac�a �l�vathe he ��;" o Qed �ouenur���orne �o Only �ai��u���ri'�� oo�he �4.to skwas . �_ڄv�l����ou.ea8aid Rouea ��it it ]!�0acha�d�k �reood� �y��\���:cove }���-?u�k k to ��ll �L s�or AZbe��9��an:�. wiy, q�that ree bb��thehehis �er�4!�~��noac<�ip+to ing haHarr�ai�mi�ir� 1ly ~y, on ��#�er the (�one aid ��. �<�. of �in b�stbs where [�lehatHarro�DA�the �octhe A�$Is ar�d�mipalhe+sk of ;P'Z s��<oHarrt ron �&���ging he�mijle�. y, ��~. it '��ooehreur�-er ��icfo@haor)�onur��Wchicainoll�BHarry �XT0s old �]6� 巈�e m and ��anZ�acliar,!�o�ld �ing Us ou��4~y, �e�����arb�S�onno��:ed �FM�ing �Be6at inn�e 's �>stand "noere @arcH�,owwaom�Pm \�+mthat <}Oirare��ll"���=in�ed e the e itouarleMve ��B�b�t�ly �agt8er-�Wat lele q��wy, t *ri��aid �you ouI �itlat �?+D's donou�ve th�at �bs"�no�no'V��d
In [35]:
# 先训练 1000 次
train(1000)
step 0: train loss 5.3148, val loss 5.3151 step 200: train loss 3.3250, val loss 3.3436 step 400: train loss 2.6803, val loss 2.7322 step 600: train loss 2.3169, val loss 2.3941 step 800: train loss 2.1576, val loss 2.2524 step 999: train loss 2.0564, val loss 2.1648
In [36]:
# 生成一下看看情况
print(decode(m.generate(context, max_new_tokens=1000)[0].tolist(), vocab))
I exuse, unafter a march to call of voice, and his teaching higher, ruding over the stairs, it dark ted between his head. 'CH EH Wood,' Ernies stonct to pointing and calling the school as he walking econdition. Sides from the dementors against air out of ears around them at him. If the page, the maze better Vastle, or Dalps of them was sister of archment could returning in himself into compart ase. Evoise miss there, search tack in, so Harry when sonal passed in the care of the corner, not all her they lower into intere, in underc"suaes a low and they fingeriuntom! Homauld forled 'eop a Serucous on them! When they'�mist?" and seized. "All riNis!" asked Ginny, rapping breathing wildarble. "Priving their seeth, too's tongue off how madered? A most unHagrid all around the light and recognar as Harry walkd-Ey, Lily Patronague–" "Hella!," he said. "There Pervan in the room. . . . . .” “Ere it is cleandable time I get, first pip you in a little foist,” she added, watrans. Hagrid's crimpasses, Harry looked at the coll and down on his got table. “Lupin tom! Well — ' To Brook Ron,” was shorrowing them a familiar “I naught you drinins!' know the deze, there was a card. “What – he’re sup the secontrat ront s in Briday, disapprudy!” She binky: Hagrid, seized Snape it’s impkonse throom in front racing the Daily an Harry, ParvenEmple. Harry sawed friender frog, and so at Hagrid’s appeared, Harry gadthugh interin to water. Justin and swallowed the treepers in his toald, and they all through Harry to kept you seize Ron and Ginny. “I’ve had got try to go to the wind reach position of despossi-Mayle whole right. “Ha!” “Beolboard,” he said quietl
注:生成的质量进步明显,且 loss 看起来并没有收敛,有继续训练的空间。
In [37]:
# 训练 1001-2000 次
train(1000)
step 0: train loss 2.0588, val loss 2.1646 step 200: train loss 1.9920, val loss 2.1198 step 400: train loss 1.9374, val loss 2.0776 step 600: train loss 1.8985, val loss 2.0346 step 800: train loss 1.8627, val loss 2.0159 step 999: train loss 1.8339, val loss 1.9861
In [38]:
# 再生成一下看看情况
print(decode(m.generate(context, max_new_tokens=1000)[0].tolist(), vocab))
I up the evening,” “It doing,” said Harry. Mrs Diddle’s piece of cage to searcase, didn’t. It had deliqustimided netension of obvious, he feeled that he said it was sent towards the carryer, right and swall to make himself in Breath and a luncht disappearance of the torches. He heard a Gaunt forating not plain. It, it alone. “But for a bombaby. To know hold it.” “You left,” said Hermione, still made her way with him as on the passage, chetting the Daily Prophet had reached the madoor, Hutching marlie purple, and a spellbook, whog, and trief opened in ead. All of continued attrippers, tiny y dressed his headmistress? “. . .” Ron whispered the haunnot els. Harry p" Moody examined through the back of the tightmare being and talk to Hermione. Whether Harry had explainly set that coolly’s rumphed away front of it, beaming upon Mr. Cattered for a hamper sap her beak-robes to brightly fent. “Vernon merops, glin and Deanwhile you ought to incredular stress with the palner of Harry and gockase, whatt, and was thickbened that she got for excited to such thescene course, in a hushed voicementc! The Ministry without a with a blap houlder of the wand in SHarry and Dursleys, told them in this less, and that was Crubbunzarr, and they would say hear it habout him to find out an end him being unplaced in his navement many career back normal as hadopp. And I haven't seem to lad how him anything happened on Schools... Why was half. This one rained me, who went on, Mrs. Durrady the sink was most me. They stared for a moment, already look. They rare just before they was Runnows starting to be the traveloping vagulously burn into an across %; Hagrid's normalt left and into the bow om began to limble of coin, removed vi
注:似乎质量确实又高了一些,且val loss看起来还在稳步降低,再训练 1000 次
In [39]:
# 训练 2001-3000 次
train(1000)
step 0: train loss 1.8371, val loss 1.9926 step 200: train loss 1.8057, val loss 1.9743 step 400: train loss 1.7830, val loss 1.9576 step 600: train loss 1.7592, val loss 1.9386 step 800: train loss 1.7442, val loss 1.9324 step 999: train loss 1.7273, val loss 1.9188
In [40]:
# 再生成一下看看情况
print(decode(m.generate(context, max_new_tokens=1000)[0].tolist(), vocab))
I "He might think you're more," came Madame Aunt Hermione hard and remazing onhot outside. "He want’ve ved and come back there up, thisn't it? Why’re seeking from – over to be worthy? Harry – how he’s rat? His not asked a Horntail of twenty hidden new settles. He had none of thing that it had won’t go to." "I'd have told –" he said. Harry stared around at the last of the mountain newsping room and Luna Sorting to stop outside her. The dess was approached there, he shook his hand. "I aware like this!" said Hermione in igonal from the first red to Spect heirst singdule at her. "Er ?he supposed ter me." Harry tried about time to say it; he had seen another in shiny, silence behind him. Ron large in the bill's nose —hunfoy, gave him as he couple he pushed at once, thought she was still still disapping the copy of the Hogword to Ron. "It found how when they're we're Dumbledore of second knew that the followed downward continuers contribut after a spat was waken something that night. "Ata mounter-of-- a row tuneerly obey-two claim while asleepished. I hope these wanders –" "Have a run with the Dark of emerald — �ever," Morfin Harry students of Ron hastily suddents in their ownersides of navement mesely's four-matters. He head to a justice toward the lock- PAPeakes Pomfrey and (Malfoy and Hermione told him he'd better go upright, then prepecto a loak, and pleased its wandlying on her) were thugh he'd something he climbed the boggart green him. "No it's my Goster!" Hermione hispered. "It was say," said Hagrid. "Did they could go need started out with him to be pretturn - not there! I can't see us neither." They walked cauldron. 2 "If see
注:train loss 和 val loss,还没啥太大差距(不太过拟合),再训练 1000 次
In [41]:
# 训练 3001-4000 次
train(1000)
step 0: train loss 1.7256, val loss 1.9212 step 200: train loss 1.7087, val loss 1.9083 step 400: train loss 1.6950, val loss 1.8981 step 600: train loss 1.6771, val loss 1.8904 step 800: train loss 1.6688, val loss 1.8873 step 999: train loss 1.6540, val loss 1.8849
In [42]:
# 再生成一下看看情况
print(decode(m.generate(context, max_new_tokens=1000)[0].tolist(), vocab))
I nextsemblood if it came into Umbridge turning upon him, he resisted a spectacular damplike broken motion that had been. "I see you," said Hermione, and the rain leapt from the lace. "Yours ago," said Harry, "I want another handwrit life. I met a wand wizardship that lie together, is early Harry' Man ran a load. Harry, young wife was a werewolst, and Winky sought, but because she was stating in us, fast and Davies is not wear of the post moves, Harry remembers of my escalm, but inmit himself and quiet not idea what Morfin might have felt beyone: the warts stood opening there is laughing and todaybe .... Our fixy Madam Rookwoods were mistaken. Be realized, just put a chamber in furious to move ago. . . .” “Shaa,” said Harry infliently. “Ronan sent you reminion, off,” she said rowly, snatching the pudiel of Ministers. “Says I’ve been what, finally ing, Anithur, Seven to Mafe mattergroitself … ” “Hagrid ifteen, you’re put out his wand up? Verish humm! It too worky a sense!” he shouted at Harry to clash angle of ink away, behind the backs a numbrelltion in your right black-lastic crack and deep close, … “ “I will find find out house this summer!” snore of white: "Please, I’ve been easy to Dad in the castle. I never four Brizar The Quibblered the Hengoess To Muggle, back draw, the most wishing he would got go and regreen better,” Harry and Oluded Ginny, was extravely resulted. The boy was facing the entrance to penaur apart from like Dumbledore— his own wand which should her school something: all of them joined a gnome, got away in main, as though happening to Hermione’s all mable, which h
注:train loss 还在每200次0.01减少,单val loss仅减少0.003,略有过拟合的风险了。就此停止吧。 如果训练数据足够多的话应该能让训练集和验证集的表现更接近一些,能支持更进一步训练。
5. 模型参数保存或加载¶
In [43]:
# 保存
torch.save(model.state_dict(), "model_params_harrypotter_tokenized.pth")
In [44]:
# 加载保存的参数c
model = GPTLanguageModel()
m = model.to(device)
# 加载已保存的参数
m.load_state_dict(
torch.load("model_params_harrypotter_tokenized.pth", weights_only=True)
)
Out[44]:
<All keys matched successfully>
6. 用 GPT 模型生成后文¶
In [54]:
encode("Who ", merges)
Out[54]:
[87, 104, 274]
In [53]:
# 尝试生成一段长点的
# 以 'Who ' 开头生成
context = torch.zeros((1, 3), dtype=torch.long, device=device)
context[0][0] = 87
context[0][1] = 104
context[0][2] = 274
print(decode(m.generate(context, max_new_tokens=3000)[0].tolist(), vocab))
Who was their own outside was sitting along of the final lics; Ron seemed to try into every classroom as Mrs. WcMago. "So she wasn't beeaving!" said Ron, looking up at Harry madly through which the longer bade-whiled around Harry to prejudish over; the veins gazed around them. There was a long loud cloud objectral silk jammes, and the Durmstrang sound squing. Beathing as though missing: Snape missed, seeing the privolunteering usly from the bowl whose clest chairs, exting his family and dished his wand, his remembered thestral wind) watched Malfoy and his nose in chech. "Minister is visive to a Cedric on my body world, you will be expelled!" he asked. "But what is this, Potter, when you are moment?" "Of course, then, wizard might gone," said Harry. He had realised with his snake. "When it is, you know how he's taking Harry." "How did you die," said Ron, explaining his thrill of her phange. "Didn't I need toilet --" But the sound open darted in a chicky. Neville didn't come and what he wanted to face Sirius as a Hogwart. Onever stood before implost, he said, "See how to war Double . . and he was cut down a visit. However, shappy father." The most, babove Harry suddenly again was the best of help between she was now there on tocal trunkje. "Professor," she said slowly. "Now very packs begin-- so. . . what's big under thes? Acked Pettigrew the kitchen, campstal scared to the neck and perhapse at Ron, who did him knowing less had today. "I --" "What's happened my poked past," said Harry. "He didn't mkiused to make Mrs. Crouch. Okay. * Win-wash, kindner almay wandered Augusting. "And you see," said Dumbledore keeping their desks like witches she was carefully. "As he -- and this toast what's going on all of three stringen Harry beneated up in our hour -- Dementors -- when we've finished my wonderful, he can carrie, next to see if he did . . ." "Look at me!" said Dumbledore sadly. "I and Dumbledore ... and when I know . . . he's foreveryone thing." "Like you think Voldemort was," said Harry, what I saw now you should look like breezing what he to do. He was still, I want to bed . . . "But he didn't know how he was half Sirius Black," he said. "A Ludo Bagman?" said Duway, stifactly distracted. "Busy bother Black, he says, I have many me... just like his answer -" There was him on the table; had Suppable for a term. "Shope at luck, quick," said Riddle softly. "They must have think they left those me..." But Harry, Riddles had given Harry at Harry, wouldn't dare a so quick high. Something happened again, they stepped over them and put them out of a few staircase, and Ron choked at away. Harry looked there famous at the top of his face. He would give him something, he looked in a patch of coof watch turns. "To it is hard, that Krum he has been to become allowed by he dawn oand out of the cupil's potion, but which probably Voldemort goes you come to find a while obsecurity?" asked Harry as they stepped down. "She had never seen the thing," said Hermione in a shrug. "Well, he saw himself said thought he didn't. Whose wants him that. Ron, exactly."' she had prowling Harry Potter's feeble into the front of Harry and it rolled over direct off 'estying, so ite he, but somethi. Harry looked back o. . . Exactly he followed down the stone steps there: Hevere, who gave a swifty reas. "Didn't work senior of hundreds, Wood?" said Harry, getting danger until a towel was almost relief. "I'm sure What about it, shut?' Harry said. "I just trust one out of this. Well, then, I've off to page our Minister for impossion in Mrs. Poor Snape. Tell her that Well, at the Misap is abrupturing the situations' ears. . . she got ridiculous yries all over the good." Hagrid started alive, also andering side crack-body. "Harry!" said, very shaking his hards toward the castle as onto Harry's face continued. "Who is why you dare my after results?" Harry told himself some interferent as to the Christmas rockieteenage and yelling, but he had he been telling Hedwig Charm on either side of the Chasers. Excrucial warning his wand and was squinting out the bucked bus. In decensibles were not the door behind the guard; the skrewtshoe Harry passed, who lived dreamy to AzkabanDungboard, so that Privet Drive had inhere, when said, "I am well going back." Harry Potter. The three of them stared at the Pensions. Harry felt his led swirly. A damper like this, he saw the caption. At Harry, Ron and Hermione were using the evening manags to the centaur in the paper to his right stride out ohead. When he amHagudered on the Hufflepuff taching the twindow open. She had just two out where he was. Students were passing therous tomorrow, he was putting down, not during they've us a though he wouldn't let Harry and Ron out of school dealy studying. And late their eyes and was in the Dursleys looking there, with suck in Harry's receision, some pastill daying a very pleacing down and corner in the dazzling mallest falling dass. "Mernal Expoint was there, Potter, Jame... That one of your lesson, and to Ron and annoyed anyway," said Harry. The Owlery, Herboloha Charlie in the teapost was lie, now flipping his eyes on the seat and onto the pitch bebin, his Unv
3. 备注
加了自己训练的 Tokenizer 以后文本的质量看起来高了不少(相比于之前的按字符编码)。
但是还是存在一些问题:
- 过拟合:从 GPT 训练的损失变化可以看出,训练集的拟合一直在变好,但是验证集的情况却在大概 2000 次训练后就有些收敛了
- 主要因为当前模型参数有 10M,而文本才仅 2M,token后才 1M。
- 应对的话需要找更多的训练文本
- Tokenizer 能达到目的,但运行比较慢
- 当前仅能运用单核CPU慢慢检索和替换。对实际的超长文本而言,需要切割文本并行处理。
- 可以加一些特殊 Token,去达到一些特殊目的
- 如标记文章结束、回答结束
- 训练时有利于让模型区分结束前后
- 推理时可以用于结束的判断
- 如标记文章结束、回答结束
- Tokenizer 本身还有很多优化和改良空间/技巧
- 如 GPT-2 论文里指出的,如果词汇表稍大,由于词频较高,可能会把 ‘dog!’ ‘dog?’ ‘dog.’ 这种看成一个subword,这是不希望的。openai 是通过正则表达式切割不同的字符手动避免的这种情况。
- 如 LLaMA 做 tokenize 的时候还会运用各种技巧,如把前方带空格的 ‘ dog’ 和不带空格的 ‘dog’ 都当成一个。