1. 说明
参考 Andrej Karpathy 的 Let’s Build GPT 讲解视频
个人参考大佬课程用 Jupyter Notebook 实现 bigram (很简单也很弱的模型,但可以帮助学习pytorch,并同时为学习 Transformer 以及 GPT 打基础)
个人给代码添加了更详细的中文备注,还添加了个人学习尝试过程中的一些测试和查看代码
2. Jupyter Notebook
In [2]:
# 读取输入的文本,这里用的莎士比亚文集:https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
text = ""
with open("input.txt", "r", encoding="utf-8") as f:
text = f.read()
In [3]:
# 查看字符数量
print(f"len char:{len(text)}")
len char:1115394
In [4]:
# 检查前1000个字符
print(text[:1000])
First Citizen: Before we proceed any further, hear me speak. All: Speak, speak. First Citizen: You are all resolved rather to die than to famish? All: Resolved. resolved. First Citizen: First, you know Caius Marcius is chief enemy to the people. All: We know't, we know't. First Citizen: Let us kill him, and we'll have corn at our own price. Is't a verdict? All: No more talking on't; let it be done: away, away! Second Citizen: One word, good citizens. First Citizen: We are accounted poor citizens, the patricians good. What authority surfeits on would relieve us: if they would yield us but the superfluity, while it were wholesome, we might guess they relieved us humanely; but they think we are too dear: the leanness that afflicts us, the object of our misery, is as an inventory to particularise their abundance; our sufferance is a gain to them Let us revenge this with our pikes, ere we become rakes: for the gods know I speak this in hunger for bread, not in thirst for revenge.
In [5]:
# 去重并按ASCII排序后获取字符集,注第0字符是回车'\n'
chars = sorted(list(set(text)))
vocab_size = len(chars)
print("".join(chars))
print(f"vocab_size:{vocab_size}")
!$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz vocab_size:65
In [6]:
# 建立一个字符到数字和数字到字符的简单映射
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
# encoder: 输入一个字符序列,输出数字序列
encode = lambda s: [
stoi[c] for c in s
]
# decode:相反,输入数字序列输出字符序列
decode = lambda l: "".join(
[itos[i] for i in l]
)
In [7]:
# 测试一下encode
encode("hello")
Out[7]:
[46, 43, 50, 50, 53]
In [8]:
# 测试一下decode
decode([46, 43, 50, 50, 53])
Out[8]:
'hello'
In [9]:
import torch
# 将整个text全部encode并变成torch可用的1维张量
data = torch.tensor(encode(text), dtype=torch.long)
In [10]:
# 查看此张量前200位
data[:200]
Out[10]:
tensor([18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 14, 43, 44, 53, 56, 43, 1, 61, 43, 1, 54, 56, 53, 41, 43, 43, 42, 1, 39, 52, 63, 1, 44, 59, 56, 58, 46, 43, 56, 6, 1, 46, 43, 39, 56, 1, 51, 43, 1, 57, 54, 43, 39, 49, 8, 0, 0, 13, 50, 50, 10, 0, 31, 54, 43, 39, 49, 6, 1, 57, 54, 43, 39, 49, 8, 0, 0, 18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 37, 53, 59, 1, 39, 56, 43, 1, 39, 50, 50, 1, 56, 43, 57, 53, 50, 60, 43, 42, 1, 56, 39, 58, 46, 43, 56, 1, 58, 53, 1, 42, 47, 43, 1, 58, 46, 39, 52, 1, 58, 53, 1, 44, 39, 51, 47, 57, 46, 12, 0, 0, 13, 50, 50, 10, 0, 30, 43, 57, 53, 50, 60, 43, 42, 8, 1, 56, 43, 57, 53, 50, 60, 43, 42, 8, 0, 0, 18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 18, 47, 56, 57, 58, 6, 1, 63, 53, 59])
In [11]:
# 分割一下,90%作为训练集,10%用于验证集
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]
In [12]:
# 设置一些待用参数
# 每次并行跑多少个序列
batch_size = 16
# 每个序列的长度,不过在bigram里最终我们只关心上一个字符,所以长度不影响模型,只影响单次训练的数据量
block_size = 8
In [13]:
# 根据cuda是否可用选取计算设备
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device
Out[13]:
'cuda'
In [14]:
# 考虑我们的模型,基于前面的序列预测下一个
# 则输入的形式应如下(bigram只管最后一位,n-gram才需要管多位)
x = train_data[: block_size]
# 目标的形式应如下(需往后偏移一位,才能照顾到最长的输入)
y = train_data[1: block_size + 1]
# 每个序列的每一位理论上都可以作为一个样本进行训练
for t in range(block_size):
context = x[:t+1]
target = y[t]
print(f"the input: {context} , the target: {target}")
the input: tensor([18]) , the target: 47 the input: tensor([18, 47]) , the target: 56 the input: tensor([18, 47, 56]) , the target: 57 the input: tensor([18, 47, 56, 57]) , the target: 58 the input: tensor([18, 47, 56, 57, 58]) , the target: 1 the input: tensor([18, 47, 56, 57, 58, 1]) , the target: 15 the input: tensor([18, 47, 56, 57, 58, 1, 15]) , the target: 47 the input: tensor([18, 47, 56, 57, 58, 1, 15, 47]) , the target: 58
In [15]:
# 以上面的原理,从data中随机抽取一批数据的函数,包括输入 x 和目标 y
def get_batch(split):
# 区分训练集和测试集
data = train_data if split == 'train' else val_data
# 随机batch_size个序列起始index
ix = torch.randint(len(data) - block_size, (batch_size,))
# 根据起始位置选出batch_size个block_size长的序列
x = torch.stack([data[i:i+block_size] for i in ix])
# 往后偏移一位作为y
y = torch.stack([data[i+1:i+block_size+1] for i in ix])
# 数据存在CPU或者GPU上
x, y = x.to(device), y.to(device)
return x, y
In [16]:
# 测试一下get_batch
xb,yb = get_batch('train')
print(f'the input: {xb.shape}\n{xb}')
print(f'the input: {yb.shape}\n{yb}')
the input: torch.Size([16, 8]) tensor([[43, 56, 1, 43, 52, 53, 59, 45], [59, 45, 46, 58, 43, 56, 5, 42], [46, 58, 1, 56, 43, 55, 59, 47], [39, 52, 45, 43, 58, 46, 1, 53], [51, 47, 52, 45, 1, 61, 47, 58], [59, 57, 1, 46, 39, 58, 46, 1], [ 6, 0, 15, 53, 51, 51, 47, 58], [43, 42, 1, 49, 47, 52, 45, 1], [43, 1, 16, 59, 49, 43, 1, 53], [42, 1, 58, 53, 1, 42, 43, 39], [52, 42, 1, 60, 43, 56, 63, 1], [ 6, 1, 57, 47, 52, 41, 43, 1], [39, 50, 50, 1, 52, 53, 58, 1], [58, 47, 53, 52, 1, 58, 53, 1], [39, 45, 45, 43, 56, 1, 47, 52], [43, 47, 45, 52, 43, 42, 1, 50]], device='cuda:0') the input: torch.Size([16, 8]) tensor([[56, 1, 43, 52, 53, 59, 45, 46], [45, 46, 58, 43, 56, 5, 42, 6], [58, 1, 56, 43, 55, 59, 47, 56], [52, 45, 43, 58, 46, 1, 53, 60], [47, 52, 45, 1, 61, 47, 58, 46], [57, 1, 46, 39, 58, 46, 1, 58], [ 0, 15, 53, 51, 51, 47, 58, 5], [42, 1, 49, 47, 52, 45, 1, 47], [ 1, 16, 59, 49, 43, 1, 53, 44], [ 1, 58, 53, 1, 42, 43, 39, 58], [42, 1, 60, 43, 56, 63, 1, 56], [ 1, 57, 47, 52, 41, 43, 1, 21], [50, 50, 1, 52, 53, 58, 1, 53], [47, 53, 52, 1, 58, 53, 1, 46], [45, 45, 43, 56, 1, 47, 52, 10], [47, 45, 52, 43, 42, 1, 50, 53]], device='cuda:0')
In [17]:
# xb,yb = get_batch('val')
# print(f'the input: {xb.shape}\n{xb}')
# print(f'the input: {yb.shape}\n{yb}')
开始建立模型
先最简单的bigrammodel,仅基于上一个字符去预测下一个字符
In [19]:
import torch
import torch.nn as nn
from torch.nn import functional as F
In [20]:
# 随机种子,不是很重要
torch.manual_seed(1337)
Out[20]:
<torch._C.Generator at 0x250e4897a70>
In [21]:
# 基于torch建Bigram模型
class BigramLanguageModel(nn.Module):
def __init__(self, vocab_size):
super().__init__()
# bigram只需要一个查找表,找到每个字符对应的下一个各个字符的概率(所以对应一个 vocab_size * vocab_size 的矩阵)。
# 具体计算时会通过查找的方式直接把对应行向量提取出来
self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
def forward(self, idx, targets=None):
# idx 是输入x,对应 batch_size*block_size 的张量,和前面对应
# targets 是输入y,也对应 batch_size*block_size 的张量,和前面对应
# 由于每个元素都会被 token_embedding_table 映射成 vocab_size 的张量,所以 logits 是三维张量,对应 batch_size * block_size * vocab_size
logits = self.token_embedding_table(idx)
if targets is None:
loss = None
else:
# 再次获取其三维 batch_size*block_size*vocab_size
B, T, C = logits.shape
# 把前两维展开成一维
logits = logits.view(B*T, C)
# 目标值也展开成一维
targets = targets.view(B*T)
# 计算交叉熵以估计预测的概率分布和实际值的差异(cross_entropy支持一个是概率分布,一个是目标值,这种两个类型和维度的输入)
loss = F.cross_entropy(logits, targets)
return logits,loss
def generate(self, idx, max_new_tokens):
# 基于模型用idx生成下一个字符(由于是bigram,其实只有最后一位有用到)
# idx 依旧是 batch_size*block_size 的二维张量,表征已有的上文序列
# 循环生成 max_new_tokens 次
for _ in range(max_new_tokens):
# 拿到每一位对下一位的预测结果 logits,也对应 batch_size * block_size * vocab_size 的张量
# 注:其实只用得到每个batch最后一位的预测结果,会浪费一些算力
logits, loss = self(idx)
# 其他抛掉,只看每个batch_最后一位的
logits = logits[:, -1, :] # 变成 batch_size * vocab_size
# 用softmax函数作用(变到[0,1]区间,且和为1),作为预测的各个后继的概率。dim=-1表示沿最后一个维度进行作用
probs = F.softmax(logits, dim=-1)
# 对每个行,根据概率进行抽样得到序号
idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
# 把当前的抽样结果加在序列后面
idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
return idx
In [22]:
# 用验证集评估损失时,选多少批次数据的交叉熵来取平均。越多越精准,但越费算力
eval_iters = 100
@torch.no_grad()
def estimate_loss():
out = {}
# 设置模型为评估模式,通过设置为评估模式,可以确保模型在验证或测试时的行为与训练时保持一致,但去除了训练特有的随机性,从而使评估更加稳定和一致。
# 而训练模式下有些层,如Dropout和Batch Normalization会随时对模型本身进行修改
model.eval()
for split in ['train', 'val']:
losses = torch.zeros(eval_iters)
for k in range(eval_iters):
X, Y = get_batch(split)
# 进行一次预测,model会调用模型的forward()函数
logits, loss = model(X, Y)
# F.cross_entropy() 得到的结果对象还包含很多其他操作和信息,想获得具体的交叉熵的值需要用 loss.item()
losses[k] = loss.item()
out[split] = losses.mean()
# 回到训练模式
model.train()
return out
In [23]:
# 创建模型
model = BigramLanguageModel(vocab_size)
# 模型放在CPU或GPU上
m = model.to(device)
In [24]:
# 用我们之前随便拿的xb,yb,测试一下模型
logits,loss = m(xb,yb)
# logits在forward()中被展开成了二维 (batch_size*block_size) * vocab_size
print(logits.shape)
# 未训练的情况下的损失函数值
print(loss)
torch.Size([128, 65]) tensor(4.7133, device='cuda:0', grad_fn=<NllLossBackward0>)
In [25]:
# 尝试用未经训练的模型进行文本生成
# 设置一个1*1的张量,值对应我们希望首字母的内容,这里设置的是21,对应'I'
context = torch.zeros((1, 1), dtype=torch.long, device=device)
context[0][0] = 21
decode(context.tolist()[0])
Out[25]:
'I'
In [26]:
# 以context开头进行生成
print(decode(m.generate(context, max_new_tokens=1000)[0].tolist()))
I!qfzxfRkRZd wc'wfNfT;OLlTEeC K jxqPToTb?bXAUG:C-SGJO-33SM:C?YI3a hs:LVXJFhXeNuwqhObxZ.tSVrddXlaSZaNevjw3cHPyZWk,f'qZa-oizCjmuX YoR&$FMVTfXibIcB!!BA!$W:CdYlHxcbegRirYeYERnkciK;lxWvHFliqmoGSKtSV&BLqWk -.SGFW.byWjbO!UelIljnF$UV&v.C-hsE3SPyckzby:CUup;MpJssX3Qwty;vJlvBPUuIkyBf&pxY-ggCIgj$k:CGlIkJdlyltSPkqmNaW-wNAXQbjxCevib3sr'T:C-&dE$HZvETERSBfxJ$Fstp-LK3:CJ-xTrg wALkOdmnubruf?qA skz;3QQkhWTm:CEtxjep$vUMUE$EwffMfMPRrFdXKISKH.JrZKINLIk!a!,iyb&y&a SadapbWPT:VE!zLtYBTEivVKN.kqfa!a!eyCRrxltpmI&fy;VE?!3MJM?qE;:3SPkUAJG&ymrdHXy'WWWgR SPm o,SB;v$Ws$.-w'KoT;AUqq-w'PF.rdaJR?;w$-z;K:WhsBoin qHugUvxIERTXEqMc$zyfX:C&ysSF-t$Yw -.mJALEHao.?nktKp$vjKujxQLqevjPTAUNXeviv3vLKZ?dpx?!ULKoCPTsrIkp$viyYH.iCVPyHDOd&usCxEQ?eRjK$ALI:C-b$gGCCJM;scP!A?h$YUgn;RGSjUcUq,FXrxlgq-GJZvSPHbAaq-tO'XEHzc-ErW:ww3C C !x.vDCKumlxlF'n!uDxlNCllgCIv'PGrIy,Odc'PLdIFGZPAkNxIgiKu bHq$ &XnGev'QzXCDWtFymZ?YLIczooixMAXGoTtL!CnIIKvUe f3SKp$GRpDytGFo?PwMb?C?YWTottR:CJiw pEHBlTQlbkmZP!P,s&qMO FoT;a!b.iTXwatDU&LivY$WxZtTXrWL;Ju;qylxkz;gGo.e
注:可以看出未经训练生成的内容完全随机,毫无章法
In [28]:
# 设置学习率,由于有用 AdamW ,这个值可以略比较随意
learning_rate = 1e-2
# 设置学习次数
max_iters = 3000
# 设置多少次学习进行一次损失监测
eval_interval = 100
# 创建优化器,使用了 AdamW
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
for iter in range(max_iters):
# 到损失检测的时候
if iter % eval_interval == 0:
losses = estimate_loss()
print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
# 取一批数据
xb, yb = get_batch('train')
# 计算损失函数
logits, loss = model(xb, yb)
# 将梯度置为0,如果不置0则梯度会累加。因为 PyTorch 在默认情况下对梯度进行累加,以便在某些情况下可以手动进行累积梯度(例如在处理大批量数据时分批计算)
optimizer.zero_grad(set_to_none=True)
# torch 自带的反向传播计算梯度
loss.backward()
# optimizer 对各参数基于其对于损失函数的梯度进行一次更新,更新的step具体值由梯度和优化算法共同决定,如这里的优化算法 AdamW
optimizer.step()
step 0: train loss 4.7332, val loss 4.7260 step 100: train loss 3.8013, val loss 3.8048 step 200: train loss 3.2313, val loss 3.2300 step 300: train loss 2.9288, val loss 2.8880 step 400: train loss 2.7445, val loss 2.7525 step 500: train loss 2.6237, val loss 2.6487 step 600: train loss 2.5736, val loss 2.5993 step 700: train loss 2.5464, val loss 2.5698 step 800: train loss 2.5251, val loss 2.5398 step 900: train loss 2.5037, val loss 2.5394 step 1000: train loss 2.5012, val loss 2.5145 step 1100: train loss 2.4819, val loss 2.5378 step 1200: train loss 2.4937, val loss 2.5185 step 1300: train loss 2.4991, val loss 2.5217 step 1400: train loss 2.4858, val loss 2.5272 step 1500: train loss 2.4746, val loss 2.5129 step 1600: train loss 2.4969, val loss 2.5068 step 1700: train loss 2.4772, val loss 2.5061 step 1800: train loss 2.4870, val loss 2.4905 step 1900: train loss 2.4749, val loss 2.4891 step 2000: train loss 2.4652, val loss 2.4952 step 2100: train loss 2.4732, val loss 2.4863 step 2200: train loss 2.4598, val loss 2.4975 step 2300: train loss 2.4649, val loss 2.4869 step 2400: train loss 2.4526, val loss 2.5045 step 2500: train loss 2.4645, val loss 2.4988 step 2600: train loss 2.4602, val loss 2.4794 step 2700: train loss 2.4643, val loss 2.4817 step 2800: train loss 2.4545, val loss 2.5067 step 2900: train loss 2.4591, val loss 2.4964
In [29]:
# 模型训练后,尝试用其进行生成
# 设置一个1*1的张量,值对应我们希望首字母的内容,这里设置的是21,对应'I'
context = torch.zeros((1, 1), dtype=torch.long, device=device)
context[0][0] = 21
decode(context.tolist()[0])
Out[29]:
'I'
In [81]:
# 进行生成
print(decode(m.generate(context, max_new_tokens=1000)[0].tolist()))
I gin cmy tofou winca e omedikinin atorin, un, Wh orir t, CI d ces nid n wethanole thourselle d!PZAy I fr be Jut maid f bl k hanon; 'ds A bes Dout f illemerer, BRY fano I dl mathepen f w--bukshe! theve at, minia! ce w garyome Goll, t m'do amyos, wises ne aves thepred; m grconend n he bshasmethityosifowha alllicr tes wothoulor athis held. INThallele, amalf merqus. MNowhinkid se o. T: TE att od OLove f cour howatltheay, y I'd bunth ast o ngy: QUTheno ghenurd DD t, waprcrrt kee oy flesserd n k's hy RYo e? TEDY: Y more oultime ARDWinthel gondoleraysind, myOnato t be Ant then merims mong rve COnd berm t welile MPOM: Y: itidyoumil llle be; yif TOnon n wefale gu, ber BORoreathorer to t u' oren te, ncoup, ghe ayous pyod w ird sce if ace, w g s IONoveryaulisou ANGOUS: ToUERMENLENII terut h f, I out hilles ssparet he: ANoce, rerselisecenk lll, cave bt! adsbound; n at sea n'd ttol, penuratref an t: Sh atotin, ten yor, Whegar olis, s RISCO: TI st tofo m t yesod -pry Maved nn gedd, ! II ar y
可以看到,虽然内容依旧没有什么意义,但看起来比训练前的内容正常了不少。 之所以难以构成有意义单词和句子,主要是因为bigram仅基于上一个单词进行预测,信息利用得太少了