BERT Word Embeddings Tutorial

最新推荐文章于 2023-10-27 23:24:32 发布

AI周红伟

最新推荐文章于 2023-10-27 23:24:32 发布

阅读量600

点赞数 1

CC 4.0 BY-SA版权

分类专栏：云计算短视频 NN deep learning neural netwo

本文链接：https://blog.csdn.net/starzhou/article/details/113823211

短视频同时被 3 个专栏收录

365 篇文章

订阅专栏

云计算

167 篇文章

订阅专栏

NN deep learning neural netwo

31 篇文章

订阅专栏

本文详细介绍了BERT的预训练模型加载、输入的特殊标记、分词处理、词汇表解析，以及如何从隐藏状态中提取词和句子的向量。通过实例展示了如何处理多义词，以及如何通过不同策略创建词和句子的向量表示，证实了BERT的向量是上下文相关的。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本文译自 BERT Word Emebddings Tutorial，我将其中部分内容进行了精简。转载请注明出处

1. Loading Pre-Trained BERT

通过 Hugging Face 安装 BERT 的 PyTorch 接口，该库还包含其它预训练语言模型的接口，如 OpenAI 的 GPT 和 GPT-2

如果您在 Google Colab 上运行此代码，每次重新连接时都必须安装此库

!pip install transformers

BERT 是由 Google 发布的预训练模型，该模型使用 Wikipedia 和 Book Corpus 数据进行训练（Book Corpus 是一个包含不同类型的 10000 + 本书的数据集）。Google 发布了一系列 BERT 的变体，但我们在这里使用的是两种可用尺寸（"base" 和 "large"）中较小的一种，并且我们设置忽略单词大小写

transformers 提供了许多应用于不同任务的 BERT 模型。在这里，我们使用最基本的 BertModel，这个接口的输出不针对任何特定任务，因此用它提取 embeddings 是个不错的选择

现在让我们导入 PyTorch，预训练 BERT 模型以及 BERT tokenizer

import torch
from transformers import BertTokenizer, BertModel
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
# logging.basicConfig(level=logging.INFO)
import matplotlib.pyplot as plt
%matplotlib inline
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

2. Input Formatting

由于 BERT 是一个预训练模型，需要输入特定格式的数据，因此我们需要：

A special token, [SEP], to mark the end of a sentence, or the separation between two sentences
A special token, [CLS], at the beginning of our text. This token is used for classification tasks, but BERT expects it no matter what your application is.
Tokens that conform with the fixed vocabulary used in BERT
The Token IDs for the tokens, from BERT’s tokenizer
Mask IDs to indicate which elements in the sequence are tokens and which are padding elements
Segment IDs used to distinguish different sentences
Positional Embeddings used to show token position within the sequence

幸运的是，使用 tokenizer.encode_plus 这个函数可以帮我们处理好一切。但是，由于这只是使用 BERT 的介绍，因此我们将主要以手动方式执行这些步骤

有关 tokenizer.encode_plus 这个函数的使用示例，可以这篇文章

2.1 Special Tokens

BERT 可以将一个或两个句子作为输入。如果是两个句子，则使用 [SEP] 将它们分隔，并且 [CLS] 标记总是出现在文本的开头；如果是一个句子，也始终需要两个标记，此时 [SEP] 表示句子的结束。举个例子

2 个句子的输入：

[CLS] The man went to the store. [SEP] He bought a gallon of milk.

1 个句子的输入：

[CLS] The man went to the store. [SEP]

2.2 Tokenization

BERT 提供了 tokenize 方法，下面我们看看它是如何处理句子的

text = "Here is the sentence I want embeddings for."
marked_text = "[CLS] " + text + " [SEP]"
# Tokenize our sentence with the BERT tokenizer.
tokenized_text = tokenizer.tokenize(marked_text)
# Print out the tokens.
print (tokenized_text)

# ['[CLS]', 'here', 'is', 'the', 'sentence', 'i', 'want', 'em', '##bed', '##ding', '##s', 'for', '.', '[SEP]']

注意 "embeddings" 这个词是如何表示的：['em', '##bed', '##ding', '##s']

原始单词已被拆分为较小的子词和字符。这些子词中前面两个##哈希符号表示该子词或字符是较大字的一部分。因此，例如 '##bed' 和 'bed' 这两个 token 不相同；第一个用于子词 "bed" 出现在较大词中时，第二个是独立的 token

为什么会这样？因为 BERT 的 tokenizer 是使用 WordPiece 模型创建的。这个模型贪婪地创建了一个固定大小的词汇表，其中包含了最适合我们语言的固定数量的字符、子词和单词。由于我们 BERT 模型的 tokenizer 限制词汇量为 30000，因此 WordPiece 模型生成的词汇表包含所有英文字符以及该模型所训练英语预料库中找到的约 30000 个最常见的单词和子词。该词汇表包含四类东西：

整个词
出现在单词开头或单独出现的子词（"embddings" 中的 "em" 与 "go get em" 中的 "em" 向量相同）
不在单词开头的子词，前面会添加上 "##"
单个字符

具体来说，tokenzier 首先检查整个单词是否在词汇表中，如果不在，它会尝试将单词分解为词汇表中最大可能的子词，如果子词也没有，它就会将整个单词分解为单个字符。所以我们至少可以将一个单词分解为单子字符的集合。基于此，不在词汇表中的单词不会分配给 "UNK" 这种万能的标记，而是分解为子词和字符标记

因此，即使 "embeddings" 这个词不在词汇表中，我们也不会将这个词视为未知词汇，而是将其分为子词 tokens ['em', '##bed', '##ding', '##s']，这将保留单词的一些上下文含义。我们甚至可以平均这些子词的嵌入向量以生成原始单词的近似向量。有关 WordPeice 的更多信息，请参考原论文

下面是我们词汇表中的一些示例

list(tokenizer.vocab.keys())[5000:5020]

['knight',
'lap',
'survey',
'ma',
'##ow',
'noise',
'billy',
'##ium',
'shooting',
'guide',
'bedroom',
'priest',
'resistance',
'motor',
'homes',
'sounded',
'giant',
'##mer',
'150',
'scenes']

将文本分解为标记后，我们必须将句子转换为词汇索引列表。从这开始，我们将使用下面的例句，其中两个句子都包含 "bank" 这个词，且它们的含义不同

# Define a new example sentence with multiple meanings of the word "bank"
text = "After stealing money from the bank vault, the bank robber was seen " \
"fishing on the Mississippi river bank."
# Add the special tokens.
marked_text = "[CLS] " + text + " [SEP]"
# Split the sentence into tokens.
tokenized_text = tokenizer.tokenize(marked_text)
# Map the token strings to their vocabulary indeces.
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# Display the words with their indeces.
for tup in zip(tokenized_text, indexed_tokens):
print('{:<12} {:>6,}'.format(tup[0], tup[1]))

[CLS] 101
after 2,044
stealing 11,065
money 2,769
from 2,013
the 1,996
bank 2,924
vault 11,632
, 1,010
the 1,996
bank 2,924
robber 27,307
was 2,001
seen 2,464
fishing 5,645
on 2,006
the 1,996
mississippi 5,900
river 2,314
bank 2,924
. 1,012
[SEP] 102

2.3 Segment ID

BERT 希望用 0 和 1 区分两个句子。也就是说，对于 tokenized_text 中的每个 token，我们必须指明它属于哪个句子。如果是单句，只需要输入一系列 1；如果是两个句子，请将第一个句子中的每个单词（包括 [SEP]）指定为 0，第二个句子指定为 1

# Mark each of the 22 tokens as belonging to sentence "1".
segments_ids = [1] * len(tokenized_text)

3. Extracting Embeddings

3.1 Running BERT on our text

接下来，我们需要将数据转换为 PyTorch tensor 类型

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

调用 from_pretrained 函数将从互联网上获取模型。当我们加载 bert-base-uncased 时，我们会在 logging 记录中看到模型的定义。该模型是一个具有 12 层的深度神经网络，解释每层的功能不在本文的范围内，您可以查看我博客之前的内容来学习相关信息

model.eval() 会使得我们的模型处于测试模式，而不是训练模式。在测试模式下，模型将会关闭 dropout regularization

# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True, # Whether the model returns all hidden-states.
)
# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

接下来，让我们把示例文本传入模型，并获取网络的隐藏状态

torch.no_grad() 告诉 PyTorch 在前向传播的过程中不构造计算图（因为我们不会在这里反向传播），这有助于减少内存消耗并加快运行速度

# Run the text through BERT, and collect all of the hidden states produced
# from all 12 layers.
with torch.no_grad():
outputs = model(tokens_tensor, segments_tensors)
# Evaluating the model will return a different number of objects based on
# how it's configured in the `from_pretrained` call earlier. In this case,
# becase we set `output_hidden_states = True`, the third item will be the
# hidden states from all layers. See the documentation for more details:
# https://huggingface.co/transformers/model_doc/bert.html#bertmodel
hidden_states = outputs[2]

3.2 Understanding the Output

hidden_states 包含的信息有点复杂，该变量有四个维度，分别是：

The Layer number（13 layers）
The batch number（1 sentence）
The word / token number（22 tokens in our sentence）
The hidden unit / feature number（768 features）

ちょっと待って，13 层？前面不是说 BERT 只有 12 层吗？因为最前面的一层是 Word Embedding 层，剩下的是 12 个 Encoder Layer

第二个维度（batch size）是一次向模型提交多个句子时使用的；不过，在这里我们只有一个句子

print ("Number of layers:", len(hidden_states), " (initial embeddings + 12 BERT layers)")
layer_i = 0
print ("Number of batches:", len(hidden_states[layer_i]))
batch_i = 0
print ("Number of tokens:", len(hidden_states[layer_i][batch_i]))
token_i = 0
print ("Number of hidden units:", len(hidden_states[layer_i][batch_i][token_i]))

Number of layers: 13 (initial embeddings + 12 BERT layers)
Number of batches: 1
Number of tokens: 22
Number of hidden units: 768

通过快速浏览指定 token 和网络层的数值范围，您会发现其中大部分值介于 [-2, 2]，少数在 - 12 附近

# For the 5th token in our sentence, select its feature values from layer 5.
token_i = 5
layer_i = 5
vec = hidden_states[layer_i][batch_i][token_i]
# Plot the values as a histogram to show their distribution.
plt.figure(figsize=(10,10))
plt.hist(vec, bins=200)
plt.show()

按层对值进行分组是有意义的，但是为了使用，我们希望它按 token 进行分组

当前的维度：[layers, batchs, tokens, features]

期望的维度：[tokens, layers, features]

幸运的是，PyTorch 的 permute 函数可以轻松的重新排列维度。但是目前 hidden_states 第一个维度是 list，所以我们要先结合各层，使其成为一个 tensor

# Concatenate the tensors for all layers. We use `stack` here to
# create a new dimension in the tensor.
token_embeddings = torch.stack(hidden_states, dim=0)
token_embeddings.size()
# torch.Size([13, 1, 22, 768])

接着我们消掉 "batch" 维度，因为我们不需要它

# Remove dimension 1, the "batches".
token_embeddings = token_embeddings.squeeze(dim=1)
token_embeddings.size()
# torch.Size([13, 22, 768])

最后，我们使用 permute 函数来交换维度

# Swap dimensions 0 and 1.
token_embeddings = token_embeddings.permute(1,0,2)
token_embeddings.size()
# torch.Size([22, 13, 768])

3.3 Creating word and sentence vectors from hidden states

我们希望为每个词获取单独的向量，或者为整个句子获取单独的向量。但是对于输入的每个词，我们有 13 个向量，每个向量的长度为 768。为了获得单个向量，我们需要将一些层的向量组合起来。但是，哪个层或组合哪些层比较好？

Word Vectors

我们用两种方式创建词向量。第一种方式是拼接最后四层，则每个单词的向量长度为 4*768=3072

# Stores the token vectors, with shape [22 x 3,072]
token_vecs_cat = []
# `token_embeddings` is a [22 x 12 x 768] tensor.
# For each token in the sentence...
for token in token_embeddings:
# `token` is a [12 x 768] tensor
# Concatenate the vectors (that is, append them together) from
# the last four layers.
# Each layer vector is 768 values, so `cat_vec` is length 3072.
cat_vec = torch.cat((token[-1], token[-2], token[-3], token[-4]), dim=0)
# Use `cat_vec` to represent `token`.
token_vecs_cat.append(cat_vec)
print ('Shape is: %d x %d' % (len(token_vecs_cat), len(token_vecs_cat[0])))
# Shape is: 22 x 3072

第二种方式是将最后四层相加

# Stores the token vectors, with shape [22 x 768]
token_vecs_sum = []
# `token_embeddings` is a [22 x 12 x 768] tensor.
# For each token in the sentence...
for token in token_embeddings:
# `token` is a [12 x 768] tensor
# Sum the vectors from the last four layers.
sum_vec = torch.sum(token[-4:], dim=0)
# Use `sum_vec` to represent `token`.
token_vecs_sum.append(sum_vec)
print ('Shape is: %d x %d' % (len(token_vecs_sum), len(token_vecs_sum[0])))
# Shape is: 22 x 768

Sentence Vectors

有很多种策略可以获得一个句子的单个向量表示，其中一种简单的方法是将倒数第 2 层所有 token 的向量求平均

# `hidden_states` has shape [13 x 1 x 22 x 768]
# `token_vecs` is a tensor with shape [22 x 768]
token_vecs = hidden_states[-2][0]
# Calculate the average of all 22 token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)
print("Our final sentence embedding vector of shape:", sentence_embedding.size())
# Our final sentence embedding vector of shape: torch.Size([768])