【Hugging Face】Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. s

原创于 2025-03-29 19:02:34 发布 · 540 阅读

9 ·

CC 4.0 BY-SA版权

文章标签：

#encode_plus #encode #tokenizer #Hugging Face

Hugging Face 专栏收录该内容

66 篇文章

订阅专栏

错误消息是：

Be aware, overflowing tokens are not returned for the setting you have chosen, 
i.e. sequence pairs with the 'longest_first' truncation strategy. 
So the returned list will always be empty even if some tokens have been removed.

- 错误分析

该错误通常发生在使用 Hugging Face 的 tokenizer 进行序列标记化（tokenization）时。具体来说，发生错误的原因是：

你正在使用 'longest_first' 作为截断策略来处理超过最大长度的序列。
但是，你在 tokenizer 的 encode 或 encode_plus 方法中启用了 return_overflowing_tokens 选项，但这种截断策略不支持返回溢出部分的 tokens。
由于 longest_first 策略只保留最重要的部分（通常是前后部分），当序列超出最大长度时，溢出部分被截断并丢弃，因此 overflowing_tokens 为空。

- 导致错误的常见代码示例

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# 过长的文本
text1 = "This is a very long text that exceeds the maximum sequence length for the model."
text2 = "Here is another very long text that might cause truncation."

# 编码超长的句子对
encoded_dict = tokenizer.encode_plus(
    text1,
    text2,
    max_length=20,  # 假设模型的最大长度是20
    truncation=True,  # 开启截断
    truncation_strategy='longest_first',  # 使用 longest_first 策略
    return_overflowing_tokens=True  # 试图返回被截断的部分
)

print(encoded_dict['overflowing_tokens'])  # ⚠️ 可能导致空结果或报错

- 问题出现在：

truncation_strategy='longest_first'：最长优先策略，优先截断较长的序列，通常在处理两个输入（如文本对）时使用。
return_overflowing_tokens=True：要求返回被截断的 tokens，但 longest_first 不会返回这些溢出的 tokens，因此 overflowing_tokens 会是空列表。

- 解决方案

1. 禁用 `return_overflowing_tokens`

如果你不需要溢出的 tokens，可以直接禁用 return_overflowing_tokens：

encoded_dict = tokenizer.encode_plus(
    text1,
    text2,
    max_length=20,
    truncation=True,  # 截断以适应最大长度
    truncation_strategy='longest_first',  # 默认情况下是 longest_first
    return_overflowing_tokens=False  # 不返回溢出部分
)

2. 使用 `only_first` 或 `only_second` 策略

如果你确实需要返回溢出的 tokens，可以将 truncation_strategy 设置为 only_first 或 only_second：

encoded_dict = tokenizer.encode_plus(
    text1,
    text2,
    max_length=20,
    truncation=True,
    truncation_strategy='only_first',  # 只截断第一个序列
    return_overflowing_tokens=True  # 现在可以返回溢出部分
)
print(encoded_dict['overflowing_tokens'])

3. 手动截断输入序列

如果你需要更灵活的序列处理，手动截断输入会更安全：

def truncate_pair(text1, text2, max_length):
    tokens1 = tokenizer.tokenize(text1)
    tokens2 = tokenizer.tokenize(text2)
    while len(tokens1) + len(tokens2) > max_length - 3:  # 留出 [CLS] 和 [SEP] 标记的空间
        if len(tokens1) > len(tokens2):
            tokens1.pop()
        else:
            tokens2.pop()
    return tokenizer.encode_plus(tokens1, tokens2, return_tensors='pt')

# 调用函数
encoded_dict = truncate_pair(text1, text2, max_length=20)

4. 使用 `padding='max_length'` 避免长度不一致

有时候，为了统一序列长度，使用 padding='max_length' 可以避免这种报错：

encoded_dict = tokenizer.encode_plus(
    text1,
    text2,
    max_length=20,
    padding='max_length',
    truncation=True,
    return_overflowing_tokens=False  # 避免报错
)

- 最终建议：

如果你希望截断序列时返回溢出的 tokens，请将 truncation_strategy 设置为 only_first 或 only_second。
如果不关心溢出部分，只需将 return_overflowing_tokens 设为 False。
进行多文本处理时，合理选择 truncation_strategy，默认的 longest_first 适用于大多数情况，但不支持返回溢出 tokens。