1. 说明

参考 Andrej Karpathy 的 Let’s Build GPT 讲解视频

个人参考大佬课程用 Jupyter Notebook 实现 bigram (很简单也很弱的模型，但可以帮助学习pytorch，并同时为学习 Transformer 以及 GPT 打基础)

个人给代码添加了更详细的中文备注，还添加了个人学习尝试过程中的一些测试和查看代码

2. Jupyter Notebook

bigram_study

In [2]:

# 读取输入的文本，这里用的莎士比亚文集：https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
text = ""
with open("input.txt", "r", encoding="utf-8") as f:
    text = f.read()

In [3]:

# 查看字符数量
print(f"len char:{len(text)}")

len char:1115394

In [4]:

# 检查前1000个字符
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.

In [5]:

# 去重并按ASCII排序后获取字符集，注第0字符是回车'\n'
chars = sorted(list(set(text)))
vocab_size = len(chars)
print("".join(chars))
print(f"vocab_size:{vocab_size}")

 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
vocab_size:65

In [6]:

# 建立一个字符到数字和数字到字符的简单映射
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
# encoder: 输入一个字符序列，输出数字序列
encode = lambda s: [
    stoi[c] for c in s
]  
# decode：相反，输入数字序列输出字符序列
decode = lambda l: "".join(
    [itos[i] for i in l]
)

In [7]:

# 测试一下encode
encode("hello")

Out[7]:

[46, 43, 50, 50, 53]

In [8]:

# 测试一下decode
decode([46, 43, 50, 50, 53])

Out[8]:

'hello'

In [9]:

import torch
# 将整个text全部encode并变成torch可用的1维张量
data = torch.tensor(encode(text), dtype=torch.long)

In [10]:

# 查看此张量前200位
data[:200]

Out[10]:

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59])

In [11]:

# 分割一下，90%作为训练集，10%用于验证集
n = int(0.9 * len(data)) 
train_data = data[:n]
val_data = data[n:]

In [12]:

# 设置一些待用参数
# 每次并行跑多少个序列
batch_size = 16 
# 每个序列的长度，不过在bigram里最终我们只关心上一个字符，所以长度不影响模型，只影响单次训练的数据量
block_size = 8

In [13]:

# 根据cuda是否可用选取计算设备
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

Out[13]:

'cuda'

In [14]:

# 考虑我们的模型，基于前面的序列预测下一个
# 则输入的形式应如下（bigram只管最后一位,n-gram才需要管多位）
x = train_data[: block_size]
# 目标的形式应如下（需往后偏移一位，才能照顾到最长的输入）
y = train_data[1: block_size + 1]
# 每个序列的每一位理论上都可以作为一个样本进行训练
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"the input: {context} , the target: {target}")

the input: tensor([18]) , the target: 47
the input: tensor([18, 47]) , the target: 56
the input: tensor([18, 47, 56]) , the target: 57
the input: tensor([18, 47, 56, 57]) , the target: 58
the input: tensor([18, 47, 56, 57, 58]) , the target: 1
the input: tensor([18, 47, 56, 57, 58,  1]) , the target: 15
the input: tensor([18, 47, 56, 57, 58,  1, 15]) , the target: 47
the input: tensor([18, 47, 56, 57, 58,  1, 15, 47]) , the target: 58

In [15]:

# 以上面的原理，从data中随机抽取一批数据的函数，包括输入 x 和目标 y
def get_batch(split):
    # 区分训练集和测试集
    data = train_data if split == 'train' else val_data
    # 随机batch_size个序列起始index
    ix = torch.randint(len(data) - block_size, (batch_size,))
    # 根据起始位置选出batch_size个block_size长的序列
    x = torch.stack([data[i:i+block_size] for i in ix])
    # 往后偏移一位作为y
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    # 数据存在CPU或者GPU上
    x, y = x.to(device), y.to(device)
    return x, y

In [16]:

# 测试一下get_batch
xb,yb = get_batch('train')
print(f'the input: {xb.shape}\n{xb}')
print(f'the input: {yb.shape}\n{yb}')

the input: torch.Size([16, 8])
tensor([[43, 56,  1, 43, 52, 53, 59, 45],
        [59, 45, 46, 58, 43, 56,  5, 42],
        [46, 58,  1, 56, 43, 55, 59, 47],
        [39, 52, 45, 43, 58, 46,  1, 53],
        [51, 47, 52, 45,  1, 61, 47, 58],
        [59, 57,  1, 46, 39, 58, 46,  1],
        [ 6,  0, 15, 53, 51, 51, 47, 58],
        [43, 42,  1, 49, 47, 52, 45,  1],
        [43,  1, 16, 59, 49, 43,  1, 53],
        [42,  1, 58, 53,  1, 42, 43, 39],
        [52, 42,  1, 60, 43, 56, 63,  1],
        [ 6,  1, 57, 47, 52, 41, 43,  1],
        [39, 50, 50,  1, 52, 53, 58,  1],
        [58, 47, 53, 52,  1, 58, 53,  1],
        [39, 45, 45, 43, 56,  1, 47, 52],
        [43, 47, 45, 52, 43, 42,  1, 50]], device='cuda:0')
the input: torch.Size([16, 8])
tensor([[56,  1, 43, 52, 53, 59, 45, 46],
        [45, 46, 58, 43, 56,  5, 42,  6],
        [58,  1, 56, 43, 55, 59, 47, 56],
        [52, 45, 43, 58, 46,  1, 53, 60],
        [47, 52, 45,  1, 61, 47, 58, 46],
        [57,  1, 46, 39, 58, 46,  1, 58],
        [ 0, 15, 53, 51, 51, 47, 58,  5],
        [42,  1, 49, 47, 52, 45,  1, 47],
        [ 1, 16, 59, 49, 43,  1, 53, 44],
        [ 1, 58, 53,  1, 42, 43, 39, 58],
        [42,  1, 60, 43, 56, 63,  1, 56],
        [ 1, 57, 47, 52, 41, 43,  1, 21],
        [50, 50,  1, 52, 53, 58,  1, 53],
        [47, 53, 52,  1, 58, 53,  1, 46],
        [45, 45, 43, 56,  1, 47, 52, 10],
        [47, 45, 52, 43, 42,  1, 50, 53]], device='cuda:0')

In [17]:

# xb,yb = get_batch('val')
# print(f'the input: {xb.shape}\n{xb}')
# print(f'the input: {yb.shape}\n{yb}')

开始建立模型
先最简单的bigrammodel，仅基于上一个字符去预测下一个字符

In [19]:

import torch
import torch.nn as nn
from torch.nn import functional as F

In [20]:

# 随机种子，不是很重要
torch.manual_seed(1337)

Out[20]:

<torch._C.Generator at 0x250e4897a70>

In [21]:

# 基于torch建Bigram模型
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # bigram只需要一个查找表，找到每个字符对应的下一个各个字符的概率（所以对应一个 vocab_size * vocab_size 的矩阵）。
        # 具体计算时会通过查找的方式直接把对应行向量提取出来
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    
    def forward(self, idx, targets=None):
        # idx 是输入x，对应 batch_size*block_size 的张量，和前面对应
        # targets 是输入y，也对应 batch_size*block_size 的张量，和前面对应

        # 由于每个元素都会被 token_embedding_table 映射成 vocab_size 的张量，所以 logits 是三维张量，对应 batch_size * block_size * vocab_size
        logits = self.token_embedding_table(idx)
        if targets is None:
            loss = None
        else:
            # 再次获取其三维 batch_size*block_size*vocab_size
            B, T, C = logits.shape
            # 把前两维展开成一维
            logits = logits.view(B*T, C)
            # 目标值也展开成一维
            targets = targets.view(B*T)
            # 计算交叉熵以估计预测的概率分布和实际值的差异（cross_entropy支持一个是概率分布，一个是目标值，这种两个类型和维度的输入）
            loss = F.cross_entropy(logits, targets)
        return logits,loss

    def generate(self, idx, max_new_tokens):
        # 基于模型用idx生成下一个字符（由于是bigram，其实只有最后一位有用到）
        # idx 依旧是 batch_size*block_size 的二维张量，表征已有的上文序列

        # 循环生成 max_new_tokens 次
        for _ in range(max_new_tokens):
            # 拿到每一位对下一位的预测结果 logits，也对应 batch_size * block_size * vocab_size 的张量
            # 注：其实只用得到每个batch最后一位的预测结果，会浪费一些算力
            logits, loss = self(idx)
            # 其他抛掉，只看每个batch_最后一位的
            logits = logits[:, -1, :] # 变成 batch_size * vocab_size
            # 用softmax函数作用（变到[0,1]区间，且和为1），作为预测的各个后继的概率。dim=-1表示沿最后一个维度进行作用
            probs = F.softmax(logits, dim=-1) 
            # 对每个行，根据概率进行抽样得到序号
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # 把当前的抽样结果加在序列后面
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [22]:

# 用验证集评估损失时，选多少批次数据的交叉熵来取平均。越多越精准，但越费算力
eval_iters = 100

@torch.no_grad()
def estimate_loss():
    out = {}
    # 设置模型为评估模式，通过设置为评估模式，可以确保模型在验证或测试时的行为与训练时保持一致，但去除了训练特有的随机性，从而使评估更加稳定和一致。
    # 而训练模式下有些层，如Dropout和Batch Normalization会随时对模型本身进行修改
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            # 进行一次预测，model会调用模型的forward()函数
            logits, loss = model(X, Y)
            # F.cross_entropy() 得到的结果对象还包含很多其他操作和信息，想获得具体的交叉熵的值需要用 loss.item()
            losses[k] = loss.item()
        out[split] = losses.mean()
    # 回到训练模式
    model.train()
    return out

In [23]:

# 创建模型
model = BigramLanguageModel(vocab_size)
# 模型放在CPU或GPU上
m = model.to(device)

In [24]:

# 用我们之前随便拿的xb,yb，测试一下模型
logits,loss = m(xb,yb)
# logits在forward()中被展开成了二维 (batch_size*block_size) * vocab_size
print(logits.shape)
# 未训练的情况下的损失函数值
print(loss)

torch.Size([128, 65])
tensor(4.7133, device='cuda:0', grad_fn=<NllLossBackward0>)

In [25]:

# 尝试用未经训练的模型进行文本生成
# 设置一个1*1的张量，值对应我们希望首字母的内容，这里设置的是21，对应'I'
context = torch.zeros((1, 1), dtype=torch.long, device=device)
context[0][0] = 21
decode(context.tolist()[0])

Out[25]:

'I'

In [26]:

# 以context开头进行生成
print(decode(m.generate(context, max_new_tokens=1000)[0].tolist()))

I!qfzxfRkRZd
wc'wfNfT;OLlTEeC K
jxqPToTb?bXAUG:C-SGJO-33SM:C?YI3a
hs:LVXJFhXeNuwqhObxZ.tSVrddXlaSZaNevjw3cHPyZWk,f'qZa-oizCjmuX
YoR&$FMVTfXibIcB!!BA!$W:CdYlHxcbegRirYeYERnkciK;lxWvHFliqmoGSKtSV&BLqWk -.SGFW.byWjbO!UelIljnF$UV&v.C-hsE3SPyckzby:CUup;MpJssX3Qwty;vJlvBPUuIkyBf&pxY-ggCIgj$k:CGlIkJdlyltSPkqmNaW-wNAXQbjxCevib3sr'T:C-&dE$HZvETERSBfxJ$Fstp-LK3:CJ-xTrg
wALkOdmnubruf?qA skz;3QQkhWTm:CEtxjep$vUMUE$EwffMfMPRrFdXKISKH.JrZKINLIk!a!,iyb&y&a
SadapbWPT:VE!zLtYBTEivVKN.kqfa!a!eyCRrxltpmI&fy;VE?!3MJM?qE;:3SPkUAJG&ymrdHXy'WWWgR
SPm 
o,SB;v$Ws$.-w'KoT;AUqq-w'PF.rdaJR?;w$-z;K:WhsBoin qHugUvxIERTXEqMc$zyfX:C&ysSF-t$Yw -.mJALEHao.?nktKp$vjKujxQLqevjPTAUNXeviv3vLKZ?dpx?!ULKoCPTsrIkp$viyYH.iCVPyHDOd&usCxEQ?eRjK$ALI:C-b$gGCCJM;scP!A?h$YUgn;RGSjUcUq,FXrxlgq-GJZvSPHbAaq-tO'XEHzc-ErW:ww3C C !x.vDCKumlxlF'n!uDxlNCllgCIv'PGrIy,Odc'PLdIFGZPAkNxIgiKu
bHq$
&XnGev'QzXCDWtFymZ?YLIczooixMAXGoTtL!CnIIKvUe f3SKp$GRpDytGFo?PwMb?C?YWTottR:CJiw
pEHBlTQlbkmZP!P,s&qMO
FoT;a!b.iTXwatDU&LivY$WxZtTXrWL;Ju;qylxkz;gGo.e

注：可以看出未经训练生成的内容完全随机，毫无章法

In [28]:

# 设置学习率，由于有用 AdamW ，这个值可以略比较随意
learning_rate = 1e-2
# 设置学习次数
max_iters = 3000
# 设置多少次学习进行一次损失监测
eval_interval = 100

# 创建优化器，使用了 AdamW
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # 到损失检测的时候
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # 取一批数据
    xb, yb = get_batch('train')

    # 计算损失函数
    logits, loss = model(xb, yb)
    # 将梯度置为0，如果不置0则梯度会累加。因为 PyTorch 在默认情况下对梯度进行累加，以便在某些情况下可以手动进行累积梯度（例如在处理大批量数据时分批计算）
    optimizer.zero_grad(set_to_none=True)
    # torch 自带的反向传播计算梯度
    loss.backward()
    # optimizer 对各参数基于其对于损失函数的梯度进行一次更新，更新的step具体值由梯度和优化算法共同决定，如这里的优化算法 AdamW
    optimizer.step()

step 0: train loss 4.7332, val loss 4.7260
step 100: train loss 3.8013, val loss 3.8048
step 200: train loss 3.2313, val loss 3.2300
step 300: train loss 2.9288, val loss 2.8880
step 400: train loss 2.7445, val loss 2.7525
step 500: train loss 2.6237, val loss 2.6487
step 600: train loss 2.5736, val loss 2.5993
step 700: train loss 2.5464, val loss 2.5698
step 800: train loss 2.5251, val loss 2.5398
step 900: train loss 2.5037, val loss 2.5394
step 1000: train loss 2.5012, val loss 2.5145
step 1100: train loss 2.4819, val loss 2.5378
step 1200: train loss 2.4937, val loss 2.5185
step 1300: train loss 2.4991, val loss 2.5217
step 1400: train loss 2.4858, val loss 2.5272
step 1500: train loss 2.4746, val loss 2.5129
step 1600: train loss 2.4969, val loss 2.5068
step 1700: train loss 2.4772, val loss 2.5061
step 1800: train loss 2.4870, val loss 2.4905
step 1900: train loss 2.4749, val loss 2.4891
step 2000: train loss 2.4652, val loss 2.4952
step 2100: train loss 2.4732, val loss 2.4863
step 2200: train loss 2.4598, val loss 2.4975
step 2300: train loss 2.4649, val loss 2.4869
step 2400: train loss 2.4526, val loss 2.5045
step 2500: train loss 2.4645, val loss 2.4988
step 2600: train loss 2.4602, val loss 2.4794
step 2700: train loss 2.4643, val loss 2.4817
step 2800: train loss 2.4545, val loss 2.5067
step 2900: train loss 2.4591, val loss 2.4964

In [29]:

# 模型训练后，尝试用其进行生成
# 设置一个1*1的张量，值对应我们希望首字母的内容，这里设置的是21，对应'I'
context = torch.zeros((1, 1), dtype=torch.long, device=device)
context[0][0] = 21
decode(context.tolist()[0])

Out[29]:

'I'

In [81]:

# 进行生成
print(decode(m.generate(context, max_new_tokens=1000)[0].tolist()))

I gin cmy tofou winca e omedikinin atorin, un, Wh orir t,
CI d ces nid n wethanole thourselle d!PZAy I fr be Jut maid f bl k hanon; 'ds
A bes
Dout f illemerer,
BRY fano I dl mathepen f w--bukshe! theve at,
minia! ce w garyome Goll, t m'do amyos, wises ne aves thepred; m grconend n he bshasmethityosifowha alllicr tes wothoulor athis held.
INThallele, amalf merqus. MNowhinkid se o.
T:
TE att od OLove f cour howatltheay, y I'd bunth ast o ngy:

QUTheno ghenurd

DD t, waprcrrt kee oy flesserd n k's hy RYo e?

TEDY:
Y more oultime

ARDWinthel gondoleraysind, myOnato t be Ant then merims mong rve COnd berm t welile
MPOM:
Y: itidyoumil llle be; yif


TOnon n wefale gu, ber

BORoreathorer to t u' oren te, ncoup, ghe ayous pyod w ird sce if ace, w g s
IONoveryaulisou
ANGOUS:
ToUERMENLENII terut h f,
I out hilles ssparet he:

ANoce, rerselisecenk lll, cave bt! adsbound; n at sea n'd ttol, penuratref an t:
Sh atotin, ten yor,
Whegar olis, s RISCO:
TI st tofo m t yesod -pry Maved nn gedd, !
II ar y

可以看到，虽然内容依旧没有什么意义，但看起来比训练前的内容正常了不少。之所以难以构成有意义单词和句子，主要是因为bigram仅基于上一个单词进行预测，信息利用得太少了

ML1.2 pytorch 实现 bigram 模型

1. 说明

2. Jupyter Notebook

发表评论取消回复

1. 说明

2. Jupyter Notebook

相关文章

发表评论 取消回复

发表评论取消回复