第2.2节：AI大模型的BERT与Transformer架构

最新推荐文章于 2025-08-30 10:17:18 发布

黑夜开发者

最新推荐文章于 2025-08-30 10:17:18 发布

阅读量712

点赞数 11

CC 4.0 BY-SA版权

分类专栏：智能时代：人人都要知道的AI课🔥 文章标签：人工智能 bert transformer

本文链接：https://blog.csdn.net/qq_21891743/article/details/150936402

智能时代：人人都要知道的AI课🔥 专栏收录该内容

14 篇文章

订阅专栏

在这里插入图片描述

🏆作者简介，黑夜开发者，CSDN领军人物，全栈领域优质创作者✌，CSDN博客专家，阿里云社区专家博主，2023年6月CSDN上海赛道top4。
🏆数年电商行业从业经验，历任核心研发工程师，项目技术负责人。
🏆本文已收录于PHP专栏：智能时代：人人都要知道的AI课
🎉欢迎 👍点赞✍评论⭐收藏

各位朋友大家好，欢迎来到我的最新专栏《智能时代：人人都要知道的AI课》，人工智能已经不再是科幻电影中的遥远概念，而是正在深刻改变我们每个人的生活。从ChatGPT的爆火，到自动驾驶的普及，从智能家居的便利，到医疗AI的突破——AI技术正在以惊人的速度重塑我们的世界，今天我们讲【BERT与Transformer架构】。

文章目录

🚀一、引言

Transformer架构是自然语言处理领域的革命性突破，它彻底改变了我们处理序列数据的方式。BERT（Bidirectional Encoder Representations from Transformers）作为Transformer架构的重要应用，在2018年横空出世，为NLP领域带来了前所未有的进步。本文将深入解析Transformer架构的核心原理和BERT模型的创新之处。

在这里插入图片描述

🚀二、Transformer架构基础

🔎2.1 什么是Transformer

Transformer是一种基于注意力机制的神经网络架构，由Google在2017年的论文《Attention Is All You Need》中提出。

核心特点：

纯注意力机制：完全基于自注意力，无需循环或卷积
并行计算：可以并行处理整个序列
长距离依赖：能够有效捕获长距离的依赖关系
可扩展性：易于扩展到更大的模型

🔎2.2 Transformer的整体架构

基本结构：

输入序列 → 编码器 → 解码器 → 输出序列

编码器-解码器架构：

编码器：将输入序列转换为隐藏表示
解码器：基于编码器输出生成目标序列
注意力机制：连接编码器和解码器

Python实现示例：

import torch
import torch.nn as nn
import math

class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model=512, n_heads=8, 
                 n_layers=6, d_ff=2048, max_seq_length=5000, dropout=0.1):
        super().__init__()
        
        self.encoder = Encoder(src_vocab_size, d_model, n_heads, n_layers, d_ff, dropout)
        self.decoder = Decoder(tgt_vocab_size, d_model, n_heads, n_layers, d_ff, dropout)
        
        self.src_embedding = nn.Embedding(src_vocab_size, d_model)
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_seq_length)
        
        self.dropout = nn.Dropout(dropout)
        self.output_layer = nn.Linear(d_model, tgt_vocab_size)
        
    def forward(self, src, tgt, src_mask=None, tgt_mask=None):
        # 编码器
        src_embedded = self.dropout(self.positional_encoding(self.src_embedding(src)))
        encoder_output = self.encoder(src_embedded, src_mask)
        
        # 解码器
        tgt_embedded = self.dropout(self.positional_encoding(self.tgt_embedding(tgt)))
        decoder_output = self.decoder(tgt_embedded, encoder_output, src_mask, tgt_mask)
        
        # 输出层
        output = self.output_layer(decoder_output)
        
        return output

🔎2.3 位置编码（Positional Encoding）

问题：Transformer没有循环结构，无法捕获序列中的位置信息。

解决方案：使用位置编码为每个位置添加位置信息。

数学公式：

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Python实现：

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length=5000):
        super().__init__()
        
        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                           (-math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        return x + self.pe[:x.size(0), :]

🚀三、注意力机制详解

🔎3.1 自注意力机制（Self-Attention）

核心思想：
每个位置都可以关注序列中的所有其他位置，学习它们之间的关系。

计算步骤：

查询（Query）、键（Key）、值（Value）：从输入计算Q、K、V
注意力分数：计算Q和K的相似度
注意力权重：使用softmax归一化分数
加权求和：用权重对V进行加权求和

数学公式：

Attention(Q, K, V) = softmax(QK^T / √d_k)V

Python实现：

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        assert d_model % n_heads == 0
        
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        
        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        self.w_o = nn.Linear(d_model, d_model)
        
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        # 计算注意力分数
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        # 应用mask（如果有）
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        # 计算注意力权重
        attention_weights = torch.softmax(scores, dim=-1)
        
        # 计算输出
        output = torch.matmul(attention_weights, V)
        
        return output, attention_weights
        
    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)
        
        # 线性变换并重塑
        Q = self.w_q(query).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        K = self.w_k(key).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        V = self.w_v(value).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        
        # 计算注意力
        attention_output, attention_weights = self.scaled_dot_product_attention(Q, K, V, mask)
        
        # 重塑并连接
        attention_output = attention_output.transpose(1, 2).contiguous().view(
            batch_size, -1, self.d_model)
        
        # 输出线性变换
        output = self.w_o(attention_output)
        
        return output, attention_weights

🔎3.2 多头注意力（Multi-Head Attention）

核心思想：
将注意力机制并行执行多次，每次使用不同的线性变换，然后合并结果。

优势：

并行计算：多个注意力头可以并行计算
不同表示：每个头可以学习不同的表示
增强表达力：提高模型的表达能力

实现原理：

def multi_head_attention_forward(self, query, key, value, mask=None):
    batch_size = query.size(0)
    
    # 为每个头计算Q、K、V
    Q = self.w_q(query).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
    K = self.w_k(key).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
    V = self.w_v(value).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
    
    # 并行计算注意力
    attention_outputs = []
    for head in range(self.n_heads):
        head_output, _ = self.scaled_dot_product_attention(
            Q[:, head], K[:, head], V[:, head], mask)
        attention_outputs.append(head_output)
    
    # 连接所有头的输出
    concat_attention = torch.cat(attention_outputs, dim=-1)
    
    # 线性变换
    output = self.w_o(concat_attention)
    
    return output

🔎3.3 编码器和解码器

编码器层：

class EncoderLayer(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout):
        super().__init__()
        
        self.self_attention = MultiHeadAttention(d_model, n_heads)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        # 自注意力 + 残差连接 + 层归一化
        attn_output, _ = self.self_attention(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # 前馈网络 + 残差连接 + 层归一化
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        
        return x

解码器层：

class DecoderLayer(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout):
        super().__init__()
        
        self.self_attention = MultiHeadAttention(d_model, n_heads)
        self.cross_attention = MultiHeadAttention(d_model, n_heads)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, encoder_output, src_mask=None, tgt_mask=None):
        # 自注意力（带mask）
        attn_output, _ = self.self_attention(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # 交叉注意力
        cross_attn_output, _ = self.cross_attention(x, encoder_output, encoder_output, src_mask)
        x = self.norm2(x + self.dropout(cross_attn_output))
        
        # 前馈网络
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))
        
        return x

🚀四、BERT模型详解

🔎4.1 BERT的核心创新

双向编码：

传统语言模型：只能看到当前位置之前的信息
BERT：可以同时看到前后文信息

预训练任务：

掩码语言模型（MLM）：预测被掩盖的词
下一句预测（NSP）：判断两个句子是否相邻

架构特点：

基于Transformer编码器
双向上下文理解
强大的特征提取能力

🔎4.2 BERT的模型架构

基本结构：

class BERT(nn.Module):
    def __init__(self, vocab_size, d_model=768, n_layers=12, n_heads=12, 
                 d_ff=3072, max_seq_length=512, dropout=0.1):
        super().__init__()
        
        self.embedding = BERTEmbedding(vocab_size, d_model, max_seq_length, dropout)
        self.encoder_layers = nn.ModuleList([
            EncoderLayer(d_model, n_heads, d_ff, dropout) 
            for _ in range(n_layers)
        ])
        
    def forward(self, input_ids, token_type_ids=None, attention_mask=None):
        # 嵌入层
        embedded = self.embedding(input_ids, token_type_ids)
        
        # 编码器层
        hidden_states = embedded
        for encoder_layer in self.encoder_layers:
            hidden_states = encoder_layer(hidden_states, attention_mask)
        
        return hidden_states

嵌入层：

class BERTEmbedding(nn.Module):
    def __init__(self, vocab_size, d_model, max_seq_length, dropout):
        super().__init__()
        
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(max_seq_length, d_model)
        self.token_type_embedding = nn.Embedding(2, d_model)  # 句子A和句子B
        
        self.layer_norm = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input_ids, token_type_ids=None):
        seq_length = input_ids.size(1)
        position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)
        position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
        
        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)
        
        # 三种嵌入相加
        embeddings = (self.token_embedding(input_ids) + 
                     self.position_embedding(position_ids) + 
                     self.token_type_embedding(token_type_ids))
        
        embeddings = self.layer_norm(embeddings)
        embeddings = self.dropout(embeddings)
        
        return embeddings

🔎4.3 BERT的预训练任务

掩码语言模型（MLM）：

class MLMHead(nn.Module):
    def __init__(self, d_model, vocab_size):
        super().__init__()
        
        self.dense = nn.Linear(d_model, d_model)
        self.activation = nn.GELU()
        self.layer_norm = nn.LayerNorm(d_model)
        self.decoder = nn.Linear(d_model, vocab_size)
        
    def forward(self, hidden_states, masked_positions):
        # 只对masked位置进行预测
        masked_states = hidden_states[masked_positions]
        
        x = self.dense(masked_states)
        x = self.activation(x)
        x = self.layer_norm(x)
        x = self.decoder(x)
        
        return x

def create_masked_lm_predictions(tokens, mask_prob=0.15):
    """创建MLM任务的输入"""
    masked_tokens = tokens.clone()
    masked_positions = []
    masked_labels = []
    
    for i, token in enumerate(tokens):
        if random.random() < mask_prob:
            masked_positions.append(i)
            masked_labels.append(token.item())
            
            # 80%概率用[MASK]替换
            if random.random() < 0.8:
                masked_tokens[i] = tokenizer.mask_token_id
            # 10%概率用随机词替换
            elif random.random() < 0.5:
                masked_tokens[i] = random.randint(0, vocab_size - 1)
            # 10%概率保持不变
    
    return masked_tokens, masked_positions, masked_labels

下一句预测（NSP）：

class NSPHead(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        
        self.classifier = nn.Linear(d_model, 2)  # 二分类
        
    def forward(self, pooled_output):
        return self.classifier(pooled_output)

def create_next_sentence_prediction(sentence_a, sentence_b, is_next=True):
    """创建NSP任务的输入"""
    if is_next:
        # 真实的下一句
        tokens_a = tokenizer.encode(sentence_a)
        tokens_b = tokenizer.encode(sentence_b)
        label = 1
    else:
        # 随机选择的句子
        tokens_a = tokenizer.encode(sentence_a)
        tokens_b = tokenizer.encode(random.choice(corpus_sentences))
        label = 0
    
    # 添加特殊token
    tokens = [tokenizer.cls_token_id] + tokens_a + [tokenizer.sep_token_id] + tokens_b + [tokenizer.sep_token_id]
    token_type_ids = [0] * (len(tokens_a) + 2) + [1] * (len(tokens_b) + 1)
    
    return tokens, token_type_ids, label

🔎4.4 BERT的训练过程

预训练流程：

def pretrain_bert(model, dataloader, optimizer, device):
    model.train()
    total_loss = 0
    
    for batch in dataloader:
        input_ids = batch['input_ids'].to(device)
        token_type_ids = batch['token_type_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        mlm_labels = batch['mlm_labels'].to(device)
        nsp_labels = batch['nsp_labels'].to(device)
        
        # 前向传播
        hidden_states = model(input_ids, token_type_ids, attention_mask)
        
        # MLM损失
        mlm_logits = model.mlm_head(hidden_states, batch['masked_positions'])
        mlm_loss = F.cross_entropy(mlm_logits, mlm_labels)
        
        # NSP损失
        pooled_output = model.pooler(hidden_states[:, 0])  # [CLS] token
        nsp_logits = model.nsp_head(pooled_output)
        nsp_loss = F.cross_entropy(nsp_logits, nsp_labels)
        
        # 总损失
        total_loss = mlm_loss + nsp_loss
        
        # 反向传播
        optimizer.zero_grad()
        total_loss.backward()
        optimizer.step()
        
        total_loss += total_loss.item()
    
    return total_loss / len(dataloader)

🚀五、BERT的应用与微调

🔎5.1 文本分类任务

情感分析：

class BERTClassifier(nn.Module):
    def __init__(self, bert_model, num_classes, dropout=0.1):
        super().__init__()
        
        self.bert = bert_model
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(768, num_classes)  # BERT-base: 768维
        
    def forward(self, input_ids, token_type_ids=None, attention_mask=None):
        # 获取BERT输出
        outputs = self.bert(input_ids, token_type_ids, attention_mask)
        pooled_output = outputs[1]  # [CLS] token的输出
        
        # 分类
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        
        return logits

def train_classifier(model, train_dataloader, val_dataloader, num_epochs=3):
    optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
    criterion = nn.CrossEntropyLoss()
    
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0
        
        for batch in train_dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            optimizer.zero_grad()
            logits = model(input_ids, attention_mask=attention_mask)
            loss = criterion(logits, labels)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
        
        # 验证
        model.eval()
        val_accuracy = evaluate_model(model, val_dataloader)
        print(f"Epoch {epoch+1}, Loss: {total_loss/len(train_dataloader):.4f}, Val Accuracy: {val_accuracy:.4f}")

🔎5.2 命名实体识别（NER）

序列标注：

class BERTNER(nn.Module):
    def __init__(self, bert_model, num_labels, dropout=0.1):
        super().__init__()
        
        self.bert = bert_model
        self.dropout = nn.Dropout(dropout)
        self.ner_layer = nn.Linear(768, num_labels)
        
    def forward(self, input_ids, token_type_ids=None, attention_mask=None):
        # 获取BERT输出
        outputs = self.bert(input_ids, token_type_ids, attention_mask)
        sequence_output = outputs[0]  # 所有token的输出
        
        # NER分类
        sequence_output = self.dropout(sequence_output)
        logits = self.ner_layer(sequence_output)
        
        return logits

def train_ner_model(model, train_dataloader, val_dataloader, num_epochs=3):
    optimizer = torch.optim.AdamW(model.parameters(), lr=3e-5)
    criterion = nn.CrossEntropyLoss(ignore_index=-100)  # 忽略padding
    
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0
        
        for batch in train_dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            optimizer.zero_grad()
            logits = model(input_ids, attention_mask=attention_mask)
            
            # 重塑logits和labels
            active_loss = attention_mask.view(-1) == 1
            active_logits = logits.view(-1, num_labels)
            active_labels = labels.view(-1)
            
            loss = criterion(active_logits, active_labels)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
        
        print(f"Epoch {epoch+1}, Loss: {total_loss/len(train_dataloader):.4f}")

🔎5.3 问答系统

阅读理解：

class BERTQA(nn.Module):
    def __init__(self, bert_model, dropout=0.1):
        super().__init__()
        
        self.bert = bert_model
        self.dropout = nn.Dropout(dropout)
        self.qa_outputs = nn.Linear(768, 2)  # start和end位置
        
    def forward(self, input_ids, token_type_ids=None, attention_mask=None):
        # 获取BERT输出
        outputs = self.bert(input_ids, token_type_ids, attention_mask)
        sequence_output = outputs[0]
        
        # QA输出
        sequence_output = self.dropout(sequence_output)
        logits = self.qa_outputs(sequence_output)
        start_logits, end_logits = logits.split(1, dim=-1)
        
        return start_logits.squeeze(-1), end_logits.squeeze(-1)

def extract_answer(start_logits, end_logits, input_ids, max_answer_length=30):
    """从logits中提取答案"""
    start_probs = F.softmax(start_logits, dim=-1)
    end_probs = F.softmax(end_logits, dim=-1)
    
    # 找到最佳的开始和结束位置
    start_idx = torch.argmax(start_probs)
    end_idx = torch.argmax(end_probs)
    
    # 确保结束位置在开始位置之后
    if end_idx < start_idx:
        end_idx = start_idx + max_answer_length
    
    # 提取答案tokens
    answer_tokens = input_ids[start_idx:end_idx+1]
    answer = tokenizer.decode(answer_tokens)
    
    return answer

🚀六、Transformer的优势与挑战

🔎6.1 技术优势

并行计算：

RNN/LSTM：需要顺序处理，无法并行
Transformer：可以并行处理整个序列

长距离依赖：

RNN/LSTM：长距离依赖容易丢失
Transformer：通过注意力机制直接连接任意位置

可扩展性：

易于扩展到更大的模型
支持不同的任务和领域

表达能力强：

多头注意力提供丰富的表示
可以学习复杂的模式

🔎6.2 技术挑战

计算复杂度：

注意力机制的计算复杂度为O(n²)
长序列处理成本高

内存消耗：

需要存储注意力矩阵
大模型需要大量GPU内存

训练稳定性：

需要仔细调整学习率
梯度消失/爆炸问题

位置编码：

绝对位置编码可能不够灵活
相对位置编码仍在研究中

🔎6.3 优化策略

计算优化：

# 稀疏注意力
class SparseAttention(nn.Module):
    def __init__(self, d_model, n_heads, sparsity_factor=4):
        super().__init__()
        self.sparsity_factor = sparsity_factor
        # 实现稀疏注意力机制
        
    def forward(self, Q, K, V, mask=None):
        # 只计算部分注意力分数
        # 减少计算复杂度
        pass

# 线性注意力
class LinearAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        # 使用线性复杂度的注意力
        
    def forward(self, Q, K, V):
        # 线性复杂度的注意力计算
        pass

内存优化：

# 梯度检查点
from torch.utils.checkpoint import checkpoint

def forward_with_checkpoint(self, x):
    return checkpoint(self.transformer_block, x)

# 混合精度训练
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

with autocast():
    output = model(input)
    loss = criterion(output, target)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

🚀七、实际应用案例

🔎7.1 文本分类应用

情感分析系统：

import torch
from transformers import BertTokenizer, BertForSequenceClassification

class SentimentAnalyzer:
    def __init__(self, model_name='bert-base-chinese'):
        self.tokenizer = BertTokenizer.from_pretrained(model_name)
        self.model = BertForSequenceClassification.from_pretrained(
            model_name, num_labels=3)  # 正面、负面、中性
        
    def analyze_sentiment(self, text):
        # 预处理
        inputs = self.tokenizer(text, return_tensors='pt', truncation=True, 
                               max_length=512, padding=True)
        
        # 预测
        with torch.no_grad():
            outputs = self.model(**inputs)
            predictions = torch.softmax(outputs.logits, dim=-1)
            predicted_class = torch.argmax(predictions, dim=-1)
        
        sentiment_map = {0: '负面', 1: '中性', 2: '正面'}
        return sentiment_map[predicted_class.item()], predictions[0].tolist()

# 使用示例
analyzer = SentimentAnalyzer()
sentiment, scores = analyzer.analyze_sentiment("这部电影真的很棒！")
print(f"情感: {sentiment}, 置信度: {scores}")

🔎7.2 文本摘要系统

抽取式摘要：

class TextSummarizer:
    def __init__(self, model_name='bert-base-chinese'):
        self.tokenizer = BertTokenizer.from_pretrained(model_name)
        self.model = BertModel.from_pretrained(model_name)
        
    def extract_summary(self, text, max_length=100):
        # 分句
        sentences = text.split('。')
        
        # 计算每个句子的重要性分数
        sentence_scores = []
        for sentence in sentences:
            if len(sentence.strip()) < 10:  # 过滤太短的句子
                continue
                
            inputs = self.tokenizer(sentence, return_tensors='pt', 
                                   truncation=True, max_length=512)
            
            with torch.no_grad():
                outputs = self.model(**inputs)
                # 使用[CLS] token的输出作为句子表示
                sentence_embedding = outputs.last_hidden_state[:, 0, :]
                # 计算重要性分数（这里使用简单的L2范数）
                score = torch.norm(sentence_embedding).item()
                sentence_scores.append((sentence, score))
        
        # 按分数排序并选择top句子
        sentence_scores.sort(key=lambda x: x[1], reverse=True)
        
        summary = ""
        current_length = 0
        for sentence, score in sentence_scores:
            if current_length + len(sentence) <= max_length:
                summary += sentence + "。"
                current_length += len(sentence)
            else:
                break
        
        return summary

# 使用示例
summarizer = TextSummarizer()
text = "人工智能是计算机科学的一个分支，它企图了解智能的实质，并生产出一种新的能以人类智能相似的方式做出反应的智能机器。该领域的研究包括机器人、语言识别、图像识别、自然语言处理和专家系统等。人工智能从诞生以来，理论和技术日益成熟，应用领域也不断扩大，可以设想，未来人工智能带来的科技产品，将会是人类智慧的'容器'。"
summary = summarizer.extract_summary(text)
print(f"摘要: {summary}")

🔎7.3 多语言翻译

跨语言理解：

class MultilingualBERT:
    def __init__(self, model_name='bert-base-multilingual-cased'):
        self.tokenizer = BertTokenizer.from_pretrained(model_name)
        self.model = BertModel.from_pretrained(model_name)
        
    def get_sentence_embedding(self, text, language='zh'):
        inputs = self.tokenizer(text, return_tensors='pt', truncation=True, 
                               max_length=512, padding=True)
        
        with torch.no_grad():
            outputs = self.model(**inputs)
            # 使用[CLS] token的输出作为句子嵌入
            embedding = outputs.last_hidden_state[:, 0, :]
        
        return embedding
    
    def compute_similarity(self, text1, text2):
        """计算两个文本的语义相似度"""
        emb1 = self.get_sentence_embedding(text1)
        emb2 = self.get_sentence_embedding(text2)
        
        # 计算余弦相似度
        similarity = torch.cosine_similarity(emb1, emb2)
        return similarity.item()

# 使用示例
multilingual_bert = MultilingualBERT()

# 跨语言相似度计算
chinese_text = "人工智能正在改变世界"
english_text = "Artificial intelligence is changing the world"
similarity = multilingual_bert.compute_similarity(chinese_text, english_text)
print(f"相似度: {similarity:.4f}")