#%% # 导入必要的库 import pandas as pd import numpy as np import torch from transformers import BertTokenizer, BertForSequenceClassification from sklearn.model_selection import train_test_split import jieba.analyse import matplotlib.pyplot as plt import seaborn as sns from datetime import datetime from sklearn.metrics import classification_report import snownlp from wordcloud import WordCloud import re from transformers import BertModel, BertTokenizer # 加载预处理后的数据 # clean_df = pd.read_csv('cleaned_mooncake_comments.csv') # 假设数据已经预处理完成 # 检查数据结构 print(clean_df.head()) print(clean_df.info()) # 使用BERT进行情感分析 # 加载预训练的BERT模型和分词器 # model = BertModel.from_pretrained('bert-base-uncased') # tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') tokenizer = BertTokenizer.from_pretrained('bert-base-chinese') model = BertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=3) # 假设分为积极、中性、消极三类 # 准备训练数据和测试数据 train_texts, test_texts, train_labels, test_labels = train_test_split( clean_df['内容'], clean_df['评分'], test_size=0.2, random_state=42 ) # 将文本转换为BERT输入格式 def encode_texts(texts): return tokenizer( texts.tolist(), padding=True, truncation=True, max_length=128, return_tensors='pt' ) train_encodings = encode_texts(train_texts) test_encodings = encode_texts(test_texts) # 定义训练和评估函数 def train_model(model, train_encodings, train_labels, epochs=3, batch_size=16): optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5) model.train() for epoch in range(epochs): for i in range(0, len(train_encodings['input_ids']), batch_size): batch = {k: v[i:i+batch_size] for k, v in train_encodings.items()} outputs = model(**batch, labels=torch.tensor(train_labels[i:i+batch_size])) loss = outputs.loss loss.backward() optimizer.step() optimizer.zero_grad() if i % 100 == 0: print(f'Epoch {epoch+1}, Bat

时间: 2025-08-31 11:11:13 AIGC 浏览: 22
<think>好的,我现在要帮助用户解决如何使用BERT模型进行中文情感分析的问题。用户希望得到一个完整的流程,包括数据预处理、模型加载、训练和评估,并且需要代码示例,涉及pandas、jieba、torch和transformers库。 首先,我需要确定用户的需求。他们可能有一定的编程基础,但对BERT模型的具体实现流程不太熟悉。需要详细但不过于复杂的步骤说明,代码示例要清晰,关键部分要有注释。 接下来,数据预处理部分。中文情感分析通常需要文本数据,比如评论数据,用户可能使用CSV格式。需要用pandas读取数据,处理缺失值。然后分词,但注意到用户提到了jieba,不过BERT通常使用其自身的tokenizer,所以可能需要解释为什么不需要jieba分词,或者如何结合使用。不过根据引用[2],分词风格可能与语料库相关,但BERT有自己的分词器,所以可能需要说明直接使用BERT的tokenizer,而不是jieba。但用户问题中提到了jieba分词,这里可能需要澄清。或者用户可能想先用jieba进行分词再输入BERT?但通常BERT的tokenizer会处理原始文本,不需要预先分词。所以可能需要指出这一点,避免用户混淆。 然后是模型加载部分。使用transformers库的BertForSequenceClassification,需要从预训练模型加载,比如'bert-base-chinese'。需要配置分类标签数,比如二分类的话是2。 训练流程部分,需要用PyTorch准备DataLoader,设置优化器、损失函数。可能需要自定义Dataset类,处理tokenization和padding。需要提醒设置设备(GPU或CPU),训练循环中的步骤,如forward、backward等。 评估部分,需要计算准确率、F1分数等指标,在验证集或测试集上进行。 还要注意代码示例的正确性,比如导入必要的库,处理数据的分割,设置正确的参数如MAX_LEN、BATCH_SIZE等。另外,需要提到保存和加载模型的方法。 可能遇到的问题:数据不平衡、过拟合,这时候需要提到数据增强、早停法、调整学习率等。或者如何处理长文本的截断和分块。 另外,用户提供的引用中,引用[2]提到了分词风格和set_cws_style函数,但可能不适用于BERT,因为BERT有自己的tokenizer,所以可能需要说明这一点,或者用户可能混淆了不同的模型的分词方式。需要指出在BERT中不需要使用jieba,而是使用BertTokenizer。 综上,整理步骤: 1. 数据准备与预处理:加载数据,处理缺失值,划分数据集。 2. 使用BertTokenizer进行文本编码,创建DataLoader。 3. 加载预训练BERT模型,修改分类头。 4. 设置训练参数,训练模型,包含训练循环和验证步骤。 5. 模型评估与预测。 在代码中,需要展示如何用pandas读取数据,用transformers的tokenizer处理文本,构建Dataset和DataLoader,定义模型,训练循环,以及评估指标的计算。 可能还需要注意的部分:padding和attention masks的处理,动态padding是否使用DataCollator,但为了简化,可能在Dataset中处理。或者使用tokenizer的padding参数。 现在需要确保代码示例正确,比如导入正确的模块,正确使用tokenizer(如truncation=True, padding=True,max_length),以及正确设置模型参数。 最后,相关问题部分,考虑用户可能进一步的问题,比如处理多分类、调整超参数、部署模型等。</think>以下是如何使用BERT模型进行中文情感分析的完整流程说明及代码示例,结合pandas、PyTorch和transformers库实现: --- ### 一、数据预处理 #### 1.1 数据加载与清洗 ```python import pandas as pd from sklearn.model_selection import train_test_split # 读取CSV数据(假设列名为'text'和'label') df = pd.read_csv('sentiment_data.csv') df = df.dropna() # 删除缺失值 # 划分训练集和验证集 train_df, val_df = train_test_split(df, test_size=0.2, random_state=42) ``` #### 1.2 文本编码 ```python from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-chinese') MAX_LEN = 128 # 根据文本长度调整 def encode_text(text): return tokenizer.encode_plus( text, add_special_tokens=True, max_length=MAX_LEN, padding='max_length', truncation=True, return_attention_mask=True ) # 示例编码 sample = "这个电影太精彩了!" encoded = encode_text(sample) print(encoded['input_ids'], encoded['attention_mask']) ``` --- ### 二、模型加载与训练 #### 2.1 创建PyTorch Dataset ```python import torch from torch.utils.data import Dataset, DataLoader class SentimentDataset(Dataset): def __init__(self, texts, labels): self.texts = [encode_text(text) for text in texts] self.labels = labels def __len__(self): return len(self.labels) def __getitem__(self, idx): return { 'input_ids': torch.tensor(self.texts[idx]['input_ids']), 'attention_mask': torch.tensor(self.texts[idx]['attention_mask']), 'label': torch.tensor(self.labels[idx], dtype=torch.long) } # 创建DataLoader BATCH_SIZE = 32 train_dataset = SentimentDataset(train_df['text'].tolist(), train_df['label'].tolist()) train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True) ``` #### 2.2 加载BERT模型 ```python from transformers import BertForSequenceClassification model = BertForSequenceClassification.from_pretrained( 'bert-base-chinese', num_labels=2 # 二分类任务 ) device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model.to(device) ``` #### 2.3 训练流程 ```python from transformers import AdamW optimizer = AdamW(model.parameters(), lr=2e-5) EPOCHS = 3 for epoch in range(EPOCHS): model.train() total_loss = 0 for batch in train_loader: optimizer.zero_grad() inputs = {k: v.to(device) for k, v in batch.items() if k != 'label'} outputs = model(**inputs, labels=batch['label'].to(device)) loss = outputs.loss loss.backward() optimizer.step() total_loss += loss.item() print(f"Epoch {epoch+1}, Loss: {total_loss/len(train_loader):.4f}") ``` --- ### 三、模型评估与预测 #### 3.1 验证集评估 ```python from sklearn.metrics import accuracy_score val_dataset = SentimentDataset(val_df['text'].tolist(), val_df['label'].tolist()) val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE) model.eval() predictions, true_labels = [], [] with torch.no_grad(): for batch in val_loader: inputs = {k: v.to(device) for k, v in batch.items() if k != 'label'} outputs = model(**inputs) logits = outputs.logits preds = torch.argmax(logits, dim=1).cpu().numpy() predictions.extend(preds) true_labels.extend(batch['label'].numpy()) print(f"Validation Accuracy: {accuracy_score(true_labels, predictions):.4f}") ``` #### 3.2 预测新数据 ```python def predict_sentiment(text): encoded = encode_text(text) input_ids = torch.tensor([encoded['input_ids']]).to(device) attention_mask = torch.tensor([encoded['attention_mask']]).to(device) with torch.no_grad(): output = model(input_ids, attention_mask=attention_mask) return torch.argmax(output.logits).item() # 示例预测 text = "服务态度非常差" print("积极" if predict_sentiment(text) == 1 else "消极") ``` --- ### 四、关键说明 1. **分词处理**:BERT使用其自带的`BertTokenizer`,无需额外使用jieba分词[^2] 2. **硬件加速**:通过`device`参数自动选择GPU训练 3. **超参数调整**:可根据任务调整`MAX_LEN`、`BATCH_SIZE`和`EPOCHS` --- 相关问题
阅读全文

相关推荐

import pandas as pd import torch from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from torch.utils.data import Dataset, DataLoader from torch.utils.tensorboard import SummaryWriter from transformers import BertTokenizer, BertForSequenceClassification, AdamW # 训练模型 device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # 加载数据 data = pd.read_csv('simplifyweibo_5_moods.csv') # 获取text和label texts = data['text'].tolist() labels = data['label'].tolist() # 将本文标签转换为数值标签 label_encoder = LabelEncoder() labels = label_encoder.fit_transform(labels) # 划分训练集和测试集 train_texts, val_texts, train_labels, val_labels = train_test_split(texts, labels, test_size=0.2, random_state=42) # 加载BERT的分词器 tokenizer = BertTokenizer.from_pretrained('./bert_localpath/') # 对文本进行编码 train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=128) val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=128) # 创建PyTorch数据集 class WeiboDataset(Dataset): def __init__(self, encodings, labels): self.encodings = encodings self.labels = labels def __getitem__(self, idx): item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()} item['labels'] = torch.tensor(self.labels[idx]) return item def __len__(self): return len(self.labels) train_dataset = WeiboDataset(train_encodings, train_labels) val_dataset = WeiboDataset(val_encodings, val_labels) # 加载BERT模型,设置输出维度为类别数 num_classes = len(label_encoder.classes_) model = BertForSequenceClassification.from_pretrained('./bert_localpath', num_labels=num_classes).to(device) # 创建DataLoader train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True) val_dataloader = DataLoader(val_dataset, batch_size=16, shuffle=True) # 定义优化器 optimizer = AdamW(model.parameters(), lr=2e-5) # 创建TensorBoard的SummmaryWriter writer = SummaryWriter('./logs') epochs = 3 for epoch in r

import pandas as pd import numpy as np import torch from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments from sklearn.model_selection import train_test_split from torch.utils.data import Dataset, DataLoader # 自定义数据集类 class CommentDataset(Dataset): def __init__(self, texts, labels, weights): self.texts = texts self.labels = labels self.weights = weights self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') self.max_len = 128 def __len__(self): return len(self.texts) def __getitem__(self, idx): text = str(self.texts[idx]) encoding = self.tokenizer.encode_plus( text, add_special_tokens=True, max_length=self.max_len, padding='max_length', truncation=True, return_attention_mask=True, return_tensors='pt', ) return { 'input_ids': encoding['input_ids'].flatten(), 'attention_mask': encoding['attention_mask'].flatten(), 'labels': torch.tensor(self.labels[idx], dtype=torch.long), 'weights': torch.tensor(self.weights[idx], dtype=torch.float) } # 数据预处理函数 def preprocess_data(file_path): # 读取数据 df = pd.read_csv(file_path) # 清洗数据:去除无评分的记录 df = df.dropna(subset=['RATING']) # 创建标签映射 df['label'] = df['RATING'].apply(lambda x: 2 if x >=4 else 1 if x ==3 else 0) # 创建样本权重(votes + 1) df['weight'] = df['VOTES'] + 1 # 划分训练集和测试集 train_df, test_df = train_test_split( df, test_size=0.2, stratify=df['label'], random_state=42 ) return train_df, test_df # 自定义训练器(支持样本权重) class WeightedTrainer(Trainer): def compute_loss(self, model, inputs, return_outputs=False): labels = inputs.get("labels") weights = inputs.get("weights") outputs = model(**inputs) logits = outputs.get('logits') loss_fct = torch.nn.CrossEntropyLoss(reduction='none') loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1)) weighted_loss = torch.mean(loss * weights) return (weighted_loss, outputs) if return_outputs else weighted_loss # 主程序 def main(): # 数据预处理 train_df, test_df = preprocess_data('comments.csv') # 创建数据集 train_dataset = CommentDataset( texts=train_df['CONTENT'].values, labels=train_df['label'].values, weights=train_df['weight'].values ) test_dataset = CommentDataset( texts=test_df['CONTENT'].values, labels=test_df['label'].values, weights=test_df['weight'].values ) # 初始化模型 model = BertForSequenceClassification.from_pretrained( 'bert-base-uncased', num_labels=3, problem_type="single_label_classification" ) # 训练参数配置 training_args = TrainingArguments( output_dir='./results', num_train_epochs=3, per_device_train_batch_size=16, per_device_eval_batch_size=16, warmup_steps=500, weight_decay=0.01, logging_dir='./logs', fp16=True, # 启用混合精度训练 evaluation_strategy='epoch', save_strategy='epoch', load_best_model_at_end=True ) # 初始化训练器 trainer = WeightedTrainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=test_dataset, compute_metrics=lambda p: { 'accuracy': (np.argmax(p.predictions, axis=1) == p.label_ids).mean() } ) # 训练模型 trainer.train() # 保存最终模型 model.save_pretrained('./sentiment_model') trainer.tokenizer.save_pretrained('./sentiment_model') if __name__ == '__main__': main() 请给出基于此代码改良后的完整代码

from transformers import AutoTokenizer,AutoModelForSequenceClassification,Trainer,TrainingArguments,pipeline from datasets import load_dataset import pandas as pd import numpy as np import seaborn as sn import matplotlib.pyplot as plt plt.switch_backend('agg') from sklearn.metrics import roc_auc_score,f1_score,confusion_matrix from sklearn.model_selection import train_test_split import torch from pprint import pprint #读取文件 df = pd.read_csv('test.csv',encoding='gbk') #定义转换字典 target_map={ 'positive':1, 'negative':-1, 'neutral':0 } #将文件中的对应单词转换为数字 单独列出一列 df['target'] = df['sentiment'].map(target_map) #将文本和标签提取出来 这里导出为新的csv文件 方便后续load_data df2 = df[['text','target']] df2.columns=['sentence','label'] df2.to_csv('data.csv',index=None) raw_datasets=load_dataset('csv',data_files='data.csv') #加载训练的数据格式 split =raw_datasets['train'].train_test_split(test_size=0.3, seed=42) #分隔为数据集和测试集 测试集占百分之30 # tokenizer = AutoTokenizer.from_pretrained('bert-base-chinese') bert_model_path = "bert-base-chinese" tokenizer = AutoTokenizer.from_pretrained(bert_model_path) model = AutoModelForSequenceClassification.from_pretrained(bert_model_path, num_labels=3) def tokenize_fn(batch): return tokenizer(batch['sentence'],truncation=True) #将数据集中的句子使用标记器进行标记化处理,以便后续在自然语言处理任务中使用 tokenized_datasets = split.map(tokenize_fn,batched=True) model = AutoModelForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=3) params_before = [] # 遍历模型中的所有参数,并将其初始值添加到params_before列表中,以便后续比较模型参数的变化。 for name, p in model.named_parameters(): params_before.append(p.detach().cpu().numpy()) training_args = TrainingArguments( output_dir='training_dir', #输出文件夹 evaluation_strategy='epoch', save_strategy='epoch', num_train_epochs=3, #训练轮次 # 将每批训练量减半 # per_device_train_batch_size=16, # per_device_eval_batch_size=64, per_device_train_batch_size=8, per_device_eval_batch_size=32 ) def compute_metrics(lo

import torch import pandas as pd import numpy as np from transformers import BertTokenizer, BertModel from torch.optim import Adam from tqdm import tqdm from torch import nn from torch.utils.data import Dataset, DataLoader from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt # 设置随机种子保证可复现性 SEED = 42 torch.manual_seed(SEED) np.random.seed(SEED) torch.backends.cudnn.deterministic = True # 检查GPU是否可用 DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(f"使用设备: {DEVICE}") # 1. 数据准备 def load_and_prepare_data(train_path, test_path): # 加载数据集 train_df = pd.read_csv(train_path) test_df = pd.read_csv(test_path) # 确保数据有需要的列 assert {'url', 'website_text', 'category'}.issubset(train_df.columns) # 创建类别标签映射 categories = train_df['category'].unique() label_map = {cat: idx for idx, cat in enumerate(sorted(categories))} num_classes = len(label_map) print(f"数据集详情:") print(f"- 训练样本数: {len(train_df)}") print(f"- 测试样本数: {len(test_df)}") print(f"- 类别数量: {num_classes}") print(f"- 类别映射: {label_map}") # 分割验证集 train_df, val_df = train_test_split( train_df, test_size=0.1, stratify=train_df['category'], # 按类别分层抽样 random_state=SEED ) return train_df, val_df, test_df, label_map, num_classes # 2. 数据处理类 class WebsiteDataset(Dataset): def __init__(self, df, tokenizer, label_map, max_length=512): """ 参数: df: 包含网站数据的DataFrame tokenizer: BERT分词器 label_map: 类别到数字的映射字典 max_length: 文本最大长度 """ self.texts = df['website_text'].tolist() self.labels = [label_map[label] for label in df['category']] self.tokenizer = tokenizer self.max_length = max_length def __len__(self): return len(self.labels) def __getitem__(self, idx): text = str(self.texts[idx]) label = self.labels[idx] # 分词和编码 encoding = self.tokenizer.encode_plus( text, add_special_tokens=True, max_length=self.max_length, padding='max_length', truncation=True, return_attention_mask=True, return_tensors='pt' ) return { 'input_ids': encoding['input_ids'].flatten(), 'attention_mask': encoding['attention_mask'].flatten(), 'label': torch.tensor(label, dtype=torch.long) } # 3. 模型架构 class WebsiteClassifier(nn.Module): def __init__(self, bert_model, num_classes, dropout=0.3): super().__init__() self.bert = bert_model self.dropout = nn.Dropout(dropout) self.classifier = nn.Linear(self.bert.config.hidden_size, num_classes) def forward(self, input_ids, attention_mask): outputs = self.bert( input_ids=input_ids, attention_mask=attention_mask ) pooled_output = outputs.pooler_output pooled_output = self.dropout(pooled_output) return self.classifier(pooled_output) # 4. 训练辅助函数 def create_dataloaders(train_df, val_df, test_df, tokenizer, label_map, batch_size=16): train_dataset = WebsiteDataset(train_df, tokenizer, label_map) val_dataset = WebsiteDataset(val_df, tokenizer, label_map) test_dataset = WebsiteDataset(test_df, tokenizer, label_map) train_loader = DataLoader( train_dataset, batch_size=batch_size, shuffle=True ) val_loader = DataLoader( val_dataset, batch_size=batch_size ) test_loader = DataLoader( test_dataset, batch_size=batch_size ) return train_loader, val_loader, test_loader def train_epoch(model, dataloader, criterion, optimizer, device): model.train() total_loss, total_acc = 0, 0 for batch in tqdm(dataloader, desc="训练批次"): input_ids = batch['input_ids'].to(device) attention_mask = batch['attention_mask'].to(device) labels = batch['label'].to(device) optimizer.zero_grad() outputs = model(input_ids, attention_mask) loss = criterion(outputs, labels) acc = (outputs.argmax(dim=1) == labels).sum().item() loss.backward() optimizer.step() total_loss += loss.item() total_acc += acc return total_loss / len(dataloader.dataset), total_acc / len(dataloader.dataset) def evaluate(model, dataloader, criterion, device): model.eval() total_loss, total_acc = 0, 0 with torch.no_grad(): for batch in tqdm(dataloader, desc="评估批次"): input_ids = batch['input_ids'].to(device) attention_mask = batch['attention_mask'].to(device) labels = batch['label'].to(device) outputs = model(input_ids, attention_mask) loss = criterion(outputs, labels) acc = (outputs.argmax(dim=1) == labels).sum().item() total_loss += loss.item() total_acc += acc return total_loss / len(dataloader.dataset), total_acc / len(dataloader.dataset) def plot_training_history(train_losses, val_losses, train_accs, val_accs): plt.figure(figsize=(15, 5)) # 损失曲线 plt.subplot(1, 2, 1) plt.plot(train_losses, label='训练损失') plt.plot(val_losses, label='验证损失') plt.title('训练和验证损失') plt.xlabel('Epoch') plt.ylabel('Loss') plt.legend() # 准确率曲线 plt.subplot(1, 2, 2) plt.plot(train_accs, label='训练准确率') plt.plot(val_accs, label='验证准确率') plt.title('训练和验证准确率') plt.xlabel('Epoch') plt.ylabel('Accuracy') plt.legend() plt.tight_layout() plt.savefig('training_history.png') plt.show() def save_model(model, tokenizer, label_map, path="website_classifier.pth"): torch.save({ 'model_state_dict': model.state_dict(), 'label_map': label_map, 'tokenizer_config': tokenizer.init_kwargs }, path) print(f"模型已保存至: {path}") # 5. 主训练函数 def train_website_classifier(): # 加载和准备数据 train_path = "train_dataset.csv" test_path = "test_dataset.csv" train_df, val_df, test_df, label_map, num_classes = load_and_prepare_data(train_path, test_path) # 初始化分词器和模型 model_dir = "./model/bert-base-multilingual-cased" tokenizer = BertTokenizer.from_pretrained(model_dir) bert_model = BertModel.from_pretrained(model_dir) model = WebsiteClassifier(bert_model, num_classes=num_classes) model.to(DEVICE) # 创建数据加载器 batch_size = 16 train_loader, val_loader, test_loader = create_dataloaders( train_df, val_df, test_df, tokenizer, label_map, batch_size ) # 设置优化器和损失函数 optimizer = Adam(model.parameters(), lr=3e-5) # BERT推荐的微调学习率 criterion = nn.CrossEntropyLoss() # 训练参数 epochs = 5 best_acc = 0 history = { 'train_loss': [], 'val_loss': [], 'train_acc': [], 'val_acc': [] } print(f"\n开始训练...\n{'='*40}") for epoch in range(epochs): print(f"\nEpoch {epoch+1}/{epochs}") print('-'*40) # 训练 train_loss, train_acc = train_epoch( model, train_loader, criterion, optimizer, DEVICE ) history['train_loss'].append(train_loss) history['train_acc'].append(train_acc) # 验证 val_loss, val_acc = evaluate( model, val_loader, criterion, DEVICE ) history['val_loss'].append(val_loss) history['val_acc'].append(val_acc) print(f"\n训练结果: 损失={train_loss:.4f}, 准确率={train_acc:.4f}") print(f"验证结果: 损失={val_loss:.4f}, 准确率={val_acc:.4f}") # 保存最佳模型 if val_acc > best_acc: best_acc = val_acc save_model(model, tokenizer, label_map, f"best_model_epoch{epoch+1}.pth") # 可视化训练过程 plot_training_history( history['train_loss'], history['val_loss'], history['train_acc'], history['val_acc'] ) # 在测试集上评估最终模型 print("\n测试最终模型...") test_loss, test_acc = evaluate(model, test_loader, criterion, DEVICE) print(f"测试集性能: 损失={test_loss:.4f}, 准确率={test_acc:.4f}") # 保存最终模型 save_model(model, tokenizer, label_map) return model, tokenizer, label_map # 6. 运行训练 if __name__ == "__main__": trained_model, tokenizer, label_map = train_website_classifier() print("训练完成!") 这是原代码 import torch import pandas as pd import numpy as np from sklearn.metrics import classification_report, confusion_matrix import seaborn as sns import matplotlib.pyplot as plt from tqdm import tqdm from torch import nn from torch.utils.data import Dataset, DataLoader from transformers import BertTokenizer, BertModel # 关键修复:确保WebsiteDataset返回'website_text' class WebsiteDataset(Dataset): def __init__(self, df, tokenizer, label_map, max_length=512): self.texts = df['website_text'].tolist() self.labels = [label_map[label] for label in df['category']] self.tokenizer = tokenizer self.max_length = max_length def __len__(self): return len(self.labels) def __getitem__(self, idx): text = str(self.texts[idx]) label = self.labels[idx] encoding = self.tokenizer.encode_plus( text, add_special_tokens=True, max_length=self.max_length, padding='max_length', truncation=True, return_attention_mask=True, return_tensors='pt' ) # 确保包含website_text字段 return { 'input_ids': encoding['input_ids'].flatten(), 'attention_mask': encoding['attention_mask'].flatten(), 'label': torch.tensor(label, dtype=torch.long), 'website_text': text # 修复:添加website_text字段 } class WebsiteClassifier(nn.Module): def __init__(self, bert_model, num_classes, dropout=0.3): super().__init__() self.bert = bert_model self.dropout = nn.Dropout(dropout) self.classifier = nn.Linear(self.bert.config.hidden_size, num_classes) def forward(self, input_ids, attention_mask): outputs = self.bert( input_ids=input_ids, attention_mask=attention_mask ) pooled_output = outputs.pooler_output pooled_output = self.dropout(pooled_output) return self.classifier(pooled_output) # 设备配置 DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu") def load_model(model_path, bert_model_dir="./model/bert-base-multilingual-cased"): """加载保存的模型""" checkpoint = torch.load(model_path, map_location=DEVICE) tokenizer = BertTokenizer.from_pretrained(bert_model_dir) bert_model = BertModel.from_pretrained(bert_model_dir) num_classes = len(checkpoint['label_map']) model = WebsiteClassifier(bert_model, num_classes) model.load_state_dict(checkpoint['model_state_dict']) model.to(DEVICE) model.eval() return model, tokenizer, checkpoint['label_map'] def evaluate_model(model, test_loader): """在测试集上评估模型性能""" true_labels = [] pred_labels = [] all_probs = [] texts = [] with torch.no_grad(): for batch in tqdm(test_loader, desc="测试批次"): input_ids = batch['input_ids'].to(DEVICE) attention_mask = batch['attention_mask'].to(DEVICE) labels = batch['label'].cpu().numpy() outputs = model(input_ids, attention_mask) probs = torch.nn.functional.softmax(outputs, dim=1).cpu().numpy() preds = outputs.argmax(dim=1).cpu().numpy() true_labels.extend(labels) pred_labels.extend(preds) all_probs.extend(probs) texts.extend(batch['website_text']) # 直接从batch获取 return true_labels, pred_labels, all_probs, texts def generate_classification_report(true_labels, pred_labels, label_map): """生成分类报告和混淆矩阵""" inv_label_map = {v: k for k, v in label_map.items()} class_names = [inv_label_map[i] for i in range(len(label_map))] report = classification_report( true_labels, pred_labels, target_names=class_names, output_dict=True, zero_division=0 ) report_df = pd.DataFrame(report).transpose() cm = confusion_matrix(true_labels, pred_labels) cm_df = pd.DataFrame(cm, index=class_names, columns=class_names) return report_df, cm_df def plot_confusion_matrix(cm_df, title='混淆矩阵'): """绘制混淆矩阵热力图""" plt.figure(figsize=(12, 10)) sns.heatmap(cm_df, annot=True, fmt='d', cmap='Blues', cbar=False) plt.title(title) plt.ylabel('真实标签') plt.xlabel('预测标签') plt.xticks(rotation=45) plt.yticks(rotation=0) plt.tight_layout() plt.savefig('confusion_matrix.png') plt.close() # 避免阻塞运行 def save_detailed_results(df, filename="detailed_results.csv"): """保存详细预测结果""" df.to_csv(filename, index=False, encoding='utf-8-sig') print(f"详细预测结果已保存至: {filename}") def main(): # 配置参数 MODEL_PATH = "website_classifier.pth" TEST_CSV = "test_dataset.csv" BATCH_SIZE = 16 print(f"使用设备: {DEVICE}") print("加载训练好的模型...") model, tokenizer, label_map = load_model(MODEL_PATH) print(f"模型加载完成,类别数: {len(label_map)}") print("\n准备测试数据...") test_df = pd.read_csv(TEST_CSV) test_dataset = WebsiteDataset(test_df, tokenizer, label_map) test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False) print("\n运行模型预测...") true_labels, pred_labels, all_probs, texts = evaluate_model(model, test_loader) print("\n生成分类报告...") report_df, cm_df = generate_classification_report(true_labels, pred_labels, label_map) print("\n================ 分类报告 ================") print(report_df) print("\n================ 混淆矩阵 ================") print(cm_df) plot_confusion_matrix(cm_df) print("\n保存详细预测结果...") inv_label_map = {v: k for k, v in label_map.items()} results_df = pd.DataFrame({ 'text': texts, 'true_label': [inv_label_map[l] for l in true_labels], 'predicted_label': [inv_label_map[l] for l in pred_labels], 'confidence': [max(probs) for probs in all_probs] }) # 添加每个类别的概率 for i in range(len(label_map)): class_name = inv_label_map[i] results_df[f'prob_{class_name}'] = [prob[i] for prob in all_probs] save_detailed_results(results_df) # 打印关键指标 accuracy = report_df.loc['accuracy', 'f1-score'] macro_f1 = report_df.loc['macro avg', 'f1-score'] weighted_f1 = report_df.loc['weighted avg', 'f1-score'] print("\n================ 关键指标 ================") print(f"测试集样本数: {len(test_df)}") print(f"整体准确率: {accuracy:.4f}") print(f"宏平均F1分数: {macro_f1:.4f}") print(f"加权平均F1分数: {weighted_f1:.4f}") if __name__ == "__main__": main() 这是验证测试集的代码,为什么第二个代码再使用模型预测时速度慢于训练时的代码

import torch import torch.nn as nn import pandas as pd import numpy as np from torch.utils.data import DataLoader, Dataset from transformers import BertTokenizer, BertModel, AdamW from tqdm import tqdm import os # 标签映射 label_map = {'运费未到账': 0, '运费金额不对': 1, '运输凭证上传': 2, '钱包提现申诉': 3} # 模型路径 - 指向本地下载的模型 model_path = r'D:\tyj\Mywork_bert\bert-base-chinese' # 加载分词器 tokenizer = BertTokenizer.from_pretrained(model_path) # 定义数据集类 class TextDataset(Dataset): def __init__(self, df): self.labels = [label_map[label] for label in df['labels']] self.texts = df['texts'].tolist() def __len__(self): return len(self.labels) def __getitem__(self, idx): text = self.texts[idx] label = self.labels[idx] encoding = tokenizer.encode_plus( text, add_special_tokens=True, max_length=512, padding='max_length', truncation=True, return_attention_mask=True, return_tensors='pt', ) return { 'input_ids': encoding['input_ids'].flatten(), 'attention_mask': encoding['attention_mask'].flatten(), 'label': torch.tensor(label, dtype=torch.long) } # 构建分类模型 class BertClassifier(nn.Module): def __init__(self, num_classes=4, dropout=0.5): super(BertClassifier, self).__init__() self.bert = BertModel.from_pretrained(model_path) self.dropout = nn.Dropout(dropout) self.classifier = nn.Linear(768, num_classes) def forward(self, input_ids, attention_mask): outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask) pooled_output = outputs.pooler_output pooled_output = self.dropout(pooled_output) logits = self.classifier(pooled_output) return logits 出现以下错误: OSError: Unable to load weights from pytorch checkpoint file for 'D:/tyj/Mywork_bert/bert-base-chinese\pytorch_model.bin' at 'D:/tyj/Mywork_bert/bert-base-chinese\pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. 我搜了一些解决方案: model = AutoModelForSeq2SeqLM.from_pretrained( args.model_name_or_path, # from_tf=bool(".ckpt" in args.model_name_or_path), config=config, from_tf=True, )这个是写在什么地方的?

使用fp32精度可能会使我的电脑内存被占满。这是我的代码,请你给我解决方案from transformers import Trainer, TrainingArguments, DataCollatorForSeq2Seq from datasets import Dataset import json import pandas as pd from transformers import AutoModelForCausalLM, AutoTokenizer import torch from sklearn.metrics import accuracy_score, recall_score, f1_score, precision_score from tqdm import tqdm from transformers import DataCollatorForLanguageModeling import numpy as np data = [] with open("E:/train.jsonl", "r", encoding="utf-8") as f: for line_number, line in enumerate(f, 1): line = line.strip() if not line: continue # 跳过空行 try: record = json.loads(line) data.append(record) except json.JSONDecodeError as e: print(f"第 {line_number} 行解析失败:{e}") # 可以在此处记录错误或跳过这行 train_df = pd.DataFrame(data) print(train_df.head()) val_data = [] with open("E:/valid.jsonl", "r", encoding="utf-8") as f: for line_number, line in enumerate(f, 1): line = line.strip() if not line: continue # 跳过空行 try: record = json.loads(line) val_data.append(record) except json.JSONDecodeError as e: print(f"第 {line_number} 行解析失败:{e}") # 可以在此处记录错误或跳过这行 valid_df = pd.DataFrame(val_data) # 指定本地模型路径 local_model_path = "E:/Qwen/Qwen2.5-1.5B-Instruct" # 加载模型和分词器(确保 local_files_only=True) model = AutoModelForCausalLM.from_pretrained( local_model_path, # torch_dtype="auto", torch_dtype=torch.float16, # 强制使用 FP16 device_map="auto", local_files_only=True ) tokenizer = AutoTokenizer.from_pretrained( local_model_path, local_files_only=True ) # 设置 pad_token 以避免生成时出现问题 tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id model.config.pad_token_id = tokenizer.pad_token_id def construct_prompt(code): prompt_template = ( "请扮演一位软件安全专家。分析以下函数代码,判断代码中是否存在安全漏洞。当函数代码存在漏洞时,你只需要输出一个单词:yes,当代码不存在漏洞时,

import torch import re import numpy as np from typing import List, Tuple, Dict, Any from transformers import ( AutoTokenizer, PreTrainedModel, AutoConfig, LlamaForCausalLM, GenerationConfig ) import torch.nn as nn from tqdm import tqdm from collections import defaultdict import pandas as pd # -------------------------- # 1. 常量与预处理函数(采用新的数据处理方式) # -------------------------- VALID_ELEMENTS = ["C", "N", "P", "O", "S", "Si", "I", "H", "Cl", "F", "Br", "B", "Se", "Fe", "Co", "As", "K", "Na"] element_to_idx = {elem: idx for idx, elem in enumerate(VALID_ELEMENTS)} CHEM_FORMULA_SIZE = r"([A-Z][a-z]*)([0-9]*)" # 新增的分子公式解析函数 def parse_chem_formula(formula): pattern = r'([A-Z][a-z]?)(\d*)' matches = re.findall(pattern, formula) element_counts = defaultdict(int) for (element, count) in matches: count = int(count) if count else 1 element_counts[element] += count return element_counts def generate_element_list(formula): element_counts = parse_chem_formula(formula) elements = [] for element, count in element_counts.items(): # 跳过氢元素 if element != "H": elements.extend([element] * count) return ''.join(elements) # 化学式转密集向量 def formula_to_dense(chem_formula: str) -> torch.Tensor: dense_vec = torch.zeros(len(VALID_ELEMENTS), dtype=torch.float32) matches = re.findall(CHEM_FORMULA_SIZE, chem_formula) for chem_symbol, num_str in matches: num = 1 if num_str == "" else int(num_str) if chem_symbol in element_to_idx: idx = element_to_idx[chem_symbol] dense_vec[idx] += num return dense_vec # 位置编码生成 (PyTorch实现) def positional_encoding(max_position: int, d_model: int, min_freq: float = 1e-4) -> torch.Tensor: position = torch.arange(max_position).unsqueeze(1) div_term = torch.exp(torch.arange(0, d_model, 2) * (-torch.log(torch.tensor(min_freq)) / d_model)) pos_enc = torch.zeros(max_position, d_model) pos_enc[:, 0::2] = torch.sin(position * div_term) pos_enc[:, 1::2] = torch.cos(position * div_term) return pos_enc # 初始化位置编码矩阵 P = positional_encoding(2000000, 254) dimn = 254 # 与位置编码维度一致 # 质谱数据编码 - 优化短数据处理:仅截断过长数据,不填充短数据 def encode_spectra(rag_tensor: list, P: torch.Tensor, dimn: int) -> list: # 返回列表而非堆叠张量 encoded_list = [] max_len = 501 # 仅对过长数据截断,不强制填充短数据 for sample in rag_tensor: mz_list, intensity_list = sample # 创建基础特征矩阵 [m/z, intensity] base_features = torch.tensor([mz_list, intensity_list], dtype=torch.float32).T # 添加位置编码特征(保留原始m/z的位置信息) pos_enc = torch.stack([P[min(int(mz), P.size(0)-1)] for mz in mz_list]) # 组合所有特征 [m/z, intensity, pos_enc...] features = torch.cat([base_features, pos_enc], dim=1) # 仅截断过长数据,短数据保持原始长度(不填充) if features.size(0) > max_len: features = features[:max_len] encoded_list.append(features) # 保留原始长度特征 return encoded_list # 质谱数据预处理 - 确保短数据完整保留 def preprocess_spectra_for_inference(spectrum_str: str, total_mass: float) -> list: # 解析质谱字符串 pairs = spectrum_str.split() mz_list, intensity_list = [], [] for pair in pairs: mz, intensity = pair.split(':') mz_list.append(float(mz)) intensity_list.append(float(intensity)) # 对于仅含一组数据的情况,额外保留原始精度(不四舍五入) if len(pairs) == 1: # 保留原始精度,不进行四舍五入 mz_list = [float(mz) for mz, _ in [pair.split(':') for pair in pairs]] intensity_list = [float(intensity) for _, intensity in [pair.split(':') for pair in pairs]] # 添加总精确质量(作为补充特征,不影响原始数据长度) mz_list.append(total_mass) intensity_list.append(0.0) # 仅对长数据进行四舍五入,短数据保留更多精度 if len(mz_list) > 5: # 数据较长时才简化 mz_list = [round(mz, 2) for mz in mz_list] intensity_list = [round(intensity, 2) for intensity in intensity_list] return [[mz_list, intensity_list]] # -------------------------- # 2. 模型类定义(保持结构,采用新实现) # -------------------------- from transformers.modeling_outputs import CausalLMOutputWithPast from transformers.generation.utils import GenerationMixin class LlamaWithEncoder(PreTrainedModel, GenerationMixin): config_class = AutoConfig _no_split_modules = ["LlamaDecoderLayer", "TransformerEncoderLayer"] def __init__(self, config, base_model=None, encoder1_dim=18, encoder2_dim=256, hidden_dim=512): # 添加config属性 self.config = config super().__init__(self.config) # 如果未提供base_model,则从config初始化 if base_model is None: self.model = LlamaForCausalLM(config) else: self.model = base_model # 第一个Transformer Encoder(处理分子式向量) encoder1_layer = nn.TransformerEncoderLayer( d_model=encoder1_dim, nhead=3, dim_feedforward=hidden_dim, batch_first=True ) self.encoder1 = nn.TransformerEncoder(encoder1_layer, num_layers=2) # 第二个Transformer Encoder(处理质谱矩阵) encoder2_layer = nn.TransformerEncoderLayer( d_model=encoder2_dim, nhead=4, dim_feedforward=hidden_dim, batch_first=True ) self.encoder2 = nn.TransformerEncoder(encoder2_layer, num_layers=2) # 投影层:将编码器输出映射到模型隐藏层维度 self.proj1 = nn.Linear(encoder1_dim, base_model.config.hidden_size) self.proj2 = nn.Linear(encoder2_dim, base_model.config.hidden_size) # 嵌入层(复制基础模型权重但不共享) self.embed_tokens = nn.Embedding( num_embeddings=base_model.config.vocab_size, embedding_dim=base_model.config.hidden_size, padding_idx=base_model.config.pad_token_id ) self.embed_tokens.weight.data = base_model.get_input_embeddings().weight.data.clone() # 必要接口实现 def get_input_embeddings(self): return self.embed_tokens def set_input_embeddings(self, value): self.embed_tokens = value def get_output_embeddings(self): return self.model.get_output_embeddings() def set_output_embeddings(self, new_embeddings): self.model.set_output_embeddings(new_embeddings) def get_base_model(self): return self.model def forward( self, input_ids=None, attention_mask=None, encoder1_inputs=None, encoder2_inputs=None, labels=None, past_key_values=None, output_attentions=None, output_hidden_states=None, return_dict=None,** kwargs ) -> CausalLMOutputWithPast: # 1. 编码器处理 # 分子式编码器输出 enc1_out = self.encoder1(encoder1_inputs) # (batch_size, 1, 18) enc1_out = enc1_out.mean(dim=1) # (batch_size, 18) enc1_proj = self.proj1(enc1_out) # (batch_size, hidden_size) # 质谱编码器输出 enc2_out = self.encoder2(encoder2_inputs) # (batch_size, seq_len, 256) enc2_out = enc2_out.mean(dim=1) # (batch_size, 256) enc2_proj = self.proj2(enc2_out) # (batch_size, hidden_size) # 合并编码器输出(用于替换<mask>) mask_replacement = (enc1_proj + enc2_proj) / 2 # (batch_size, hidden_size) # 2. 获取原始嵌入(避免inplace,全程用新张量) embeddings = self.embed_tokens(input_ids) # (batch_size, seq_len, hidden_size) batch_size, seq_len, hidden_size = embeddings.size() # 3. 替换<mask> token(第三个token,索引=2):用拼接替代inplace赋值 if seq_len > 2: mask_embed = mask_replacement.unsqueeze(1) # (batch_size, 1, hidden_size) # 拆分张量并拼接(前2个token + 替换的mask_embed + 剩余token) part1 = embeddings[:, :2, :] # (batch_size, 2, hidden_size) part2 = mask_embed # (batch_size, 1, hidden_size) part3 = embeddings[:, 3:, :] # (batch_size, seq_len-3, hidden_size) # 拼接为新张量(无inplace操作) new_embeddings = torch.cat([part1, part2, part3], dim=1) # (batch_size, seq_len, hidden_size) else: new_embeddings = embeddings # 序列过短时直接使用原始嵌入 # 4. 调用基础模型 return self.model( inputs_embeds=new_embeddings, attention_mask=attention_mask, labels=labels, past_key_values=past_key_values, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict, ) def prepare_inputs_for_generation(self, input_ids, **kwargs): return { "input_ids": input_ids, "attention_mask": kwargs.get("attention_mask", None), "encoder1_inputs": kwargs.get("encoder1_inputs", None), "encoder2_inputs": kwargs.get("encoder2_inputs", None), } def _get_generation_device(self): return next(self.parameters()).device # -------------------------- # 3. 加载模型和Tokenizer(修复核心错误) # -------------------------- model_path = "./llama3.2-SELFIES" # 模型保存路径 # 加载分词器 tokenizer = AutoTokenizer.from_pretrained(model_path) # 确保mask token存在 if tokenizer.mask_token is None: tokenizer.add_special_tokens({"mask_token": "<mask>"}) # 加载模型配置 config = AutoConfig.from_pretrained(model_path) # 设备配置(优先使用GPU) device = "cuda:0" if torch.cuda.is_available() else "cpu" # 修复:先加载基础模型,再传入自定义模型 base_model = LlamaForCausalLM.from_pretrained( model_path, config=config, torch_dtype=torch.bfloat16, # 确保基础模型为bfloat16精度 device_map=device ) # 使用基础模型初始化自定义模型 model = LlamaWithEncoder( config=config, base_model=base_model, encoder1_dim=18, encoder2_dim=256, hidden_dim=512 ) model = model.to(device) # 先转移到设备 model.eval() # 推理模式 # -------------------------- # 4. 推理函数(适配新的数据处理方式) # -------------------------- def generate_selfies( formula: str, spectrum_str: str, total_mass: float, max_length: int = 512, temperature: float = 0.7, top_p: float = 0.9 ) -> str: """生成SELFIES字符串""" model_device = next(model.parameters()).device # 1. 生成element_list element_list = generate_element_list(formula) # 2. 处理分子式向量 formula_vec = formula_to_dense(formula).unsqueeze(0).unsqueeze(0) # (1,1,18) formula_vec = formula_vec.to(model_device, dtype=torch.bfloat16) # 3. 处理质谱数据(使用新的预处理和编码方式) spectra_data = preprocess_spectra_for_inference(spectrum_str, total_mass) spec_encoded = encode_spectra(spectra_data, P, dimn) # 得到列表形式的编码结果 spec_matrix = spec_encoded[0].to(model_device, dtype=torch.bfloat16).unsqueeze(0) # 添加批次维度 # 4. 构造输入提示 prompt = f"<|User|><|Spectrum|>{element_list}" input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model_device) attention_mask = torch.ones_like(input_ids).to(model_device) # 5. 模型生成 with torch.no_grad(): # 关闭梯度计算 outputs = model.generate( input_ids=input_ids, attention_mask=attention_mask, encoder1_inputs=formula_vec, # 分子式特征 encoder2_inputs=spec_matrix, # 质谱特征 max_length=max_length, temperature=temperature, top_p=top_p, ) # 6. 解码生成结果(去除特殊token) generated = tokenizer.decode(outputs[0], skip_special_tokens=False) return generated # -------------------------- # 5. 推理示例 # -------------------------- if __name__ == "__main__": # 示例输入 example_formula = "C9H9N3O2S2" # 分子式 example_spectrum_str = "256.0153:100.000000" # mz:intensity格式 example_total_mass = 255.0136185 # 总精确质量 # 生成SELFIES result = generate_selfies( formula=example_formula, spectrum_str=example_spectrum_str, total_mass=example_total_mass, max_length=512, temperature=0.7, top_p=0.95 ) print("生成的SELFIES字符串:") print(result)修改代码,解决问题Some weights of LlamaForCausalLM were not initialized from the model checkpoint at ./llama3.2-SELFIES and are newly initialized: ['lm_head.weight', 'model.embed_tokens.weight', 'model.layers.0.input_layernorm.weight', 'model.layers.0.mlp.down_proj.weight', 'model.layers.0.mlp.gate_proj.weight', 'model.layers.0.mlp.up_proj.weight', 'model.layers.0.post_attention_layernorm.weight', 'model.layers.0.self_attn.k_proj.weight', 'model.layers.0.self_attn.o_proj.weight', 'model.layers.0.self_attn.q_proj.weight', 'model.layers.0.self_attn.v_proj.weight', 'model.layers.1.input_layernorm.weight', 'model.layers.1.mlp.down_proj.weight', 'model.layers.1.mlp.gate_proj.weight', 'model.layers.1.mlp.up_proj.weight', 'model.layers.1.post_attention_layernorm.weight', 'model.layers.1.self_attn.k_proj.weight', 'model.layers.1.self_attn.o_proj.weight', 'model.layers.1.self_attn.q_proj.weight', 'model.layers.1.self_attn.v_proj.weight', 'model.layers.10.input_layernorm.weight', 'model.layers.10.mlp.down_proj.weight', 'model.layers.10.mlp.gate_proj.weight', 'model.layers.10.mlp.up_proj.weight', 'model.layers.10.post_attention_layernorm.weight', 'model.layers.10.self_attn.k_proj.weight', 'model.layers.10.self_attn.o_proj.weight', 'model.layers.10.self_attn.q_proj.weight', 'model.layers.10.self_attn.v_proj.weight', 'model.layers.11.input_layernorm.weight', 'model.layers.11.mlp.down_proj.weight', 'model.layers.11.mlp.gate_proj.weight', 'model.layers.11.mlp.up_proj.weight', 'model.layers.11.post_attention_layernorm.weight', 'model.layers.11.self_attn.k_proj.weight', 'model.layers.11.self_attn.o_proj.weight', 'model.layers.11.self_attn.q_proj.weight', 'model.layers.11.self_attn.v_proj.weight', 'model.layers.12.input_layernorm.weight', 'model.layers.12.mlp.down_proj.weight', 'model.layers.12.mlp.gate_proj.weight', 'model.layers.12.mlp.up_proj.weight', 'model.layers.12.post_attention_layernorm.weight', 'model.layers.12.self_attn.k_proj.weight', 'model.layers.12.self_attn.o_proj.weight', 'model.layers.12.self_attn.q_proj.weight', 'model.layers.12.self_attn.v_proj.weight', 'model.layers.13.input_layernorm.weight', 'model.layers.13.mlp.down_proj.weight', 'model.layers.13.mlp.gate_proj.weight', 'model.layers.13.mlp.up_proj.weight', 'model.layers.13.post_attention_layernorm.weight', 'model.layers.13.self_attn.k_proj.weight', 'model.layers.13.self_attn.o_proj.weight', 'model.layers.13.self_attn.q_proj.weight', 'model.layers.13.self_attn.v_proj.weight', 'model.layers.14.input_layernorm.weight', 'model.layers.14.mlp.down_proj.weight', 'model.layers.14.mlp.gate_proj.weight', 'model.layers.14.mlp.up_proj.weight', 'model.layers.14.post_attention_layernorm.weight', 'model.layers.14.self_attn.k_proj.weight', 'model.layers.14.self_attn.o_proj.weight', 'model.layers.14.self_attn.q_proj.weight', 'model.layers.14.self_attn.v_proj.weight', 'model.layers.15.input_layernorm.weight', 'model.layers.15.mlp.down_proj.weight', 'model.layers.15.mlp.gate_proj.weight', 'model.layers.15.mlp.up_proj.weight', 'model.layers.15.post_attention_layernorm.weight', 'model.layers.15.self_attn.k_proj.weight', 'model.layers.15.self_attn.o_proj.weight', 'model.layers.15.self_attn.q_proj.weight', 'model.layers.15.self_attn.v_proj.weight', 'model.layers.16.input_layernorm.weight', 'model.layers.16.mlp.down_proj.weight', 'model.layers.16.mlp.gate_proj.weight', 'model.layers.16.mlp.up_proj.weight', 'model.layers.16.post_attention_layernorm.weight', 'model.layers.16.self_attn.k_proj.weight', 'model.layers.16.self_attn.o_proj.weight', 'model.layers.16.self_attn.q_proj.weight', 'model.layers.16.self_attn.v_proj.weight', 'model.layers.17.input_layernorm.weight', 'model.layers.17.mlp.down_proj.weight', 'model.layers.17.mlp.gate_proj.weight', 'model.layers.17.mlp.up_proj.weight', 'model.layers.17.post_attention_layernorm.weight', 'model.layers.17.self_attn.k_proj.weight', 'model.layers.17.self_attn.o_proj.weight', 'model.layers.17.self_attn.q_proj.weight', 'model.layers.17.self_attn.v_proj.weight', 'model.layers.18.input_layernorm.weight', 'model.layers.18.mlp.down_proj.weight', 'model.layers.18.mlp.gate_proj.weight', 'model.layers.18.mlp.up_proj.weight', 'model.layers.18.post_attention_layernorm.weight', 'model.layers.18.self_attn.k_proj.weight', 'model.layers.18.self_attn.o_proj.weight', 'model.layers.18.self_attn.q_proj.weight', 'model.layers.18.self_attn.v_proj.weight', 'model.layers.19.input_layernorm.weight', 'model.layers.19.mlp.down_proj.weight', 'model.layers.19.mlp.gate_proj.weight', 'model.layers.19.mlp.up_proj.weight', 'model.layers.19.post_attention_layernorm.weight', 'model.layers.19.self_attn.k_proj.weight', 'model.layers.19.self_attn.o_proj.weight', 'model.layers.19.self_attn.q_proj.weight', 'model.layers.19.self_attn.v_proj.weight', 'model.layers.2.input_layernorm.weight', 'model.layers.2.mlp.down_proj.weight', 'model.layers.2.mlp.gate_proj.weight', 'model.layers.2.mlp.up_proj.weight', 'model.layers.2.post_attention_layernorm.weight', 'model.layers.2.self_attn.k_proj.weight', 'model.layers.2.self_attn.o_proj.weight', 'model.layers.2.self_attn.q_proj.weight', 'model.layers.2.self_attn.v_proj.weight', 'model.layers.20.input_layernorm.weight', 'model.layers.20.mlp.down_proj.weight', 'model.layers.20.mlp.gate_proj.weight', 'model.layers.20.mlp.up_proj.weight', 'model.layers.20.post_attention_layernorm.weight', 'model.layers.20.self_attn.k_proj.weight', 'model.layers.20.self_attn.o_proj.weight', 'model.layers.20.self_attn.q_proj.weight', 'model.layers.20.self_attn.v_proj.weight', 'model.layers.21.input_layernorm.weight', 'model.layers.21.mlp.down_proj.weight', 'model.layers.21.mlp.gate_proj.weight', 'model.layers.21.mlp.up_proj.weight', 'model.layers.21.post_attention_layernorm.weight', 'model.layers.21.self_attn.k_proj.weight', 'model.layers.21.self_attn.o_proj.weight', 'model.layers.21.self_attn.q_proj.weight', 'model.layers.21.self_attn.v_proj.weight', 'model.layers.22.input_layernorm.weight', 'model.layers.22.mlp.down_proj.weight', 'model.layers.22.mlp.gate_proj.weight', 'model.layers.22.mlp.up_proj.weight', 'model.layers.22.post_attention_layernorm.weight', 'model.layers.22.self_attn.k_proj.weight', 'model.layers.22.self_attn.o_proj.weight', 'model.layers.22.self_attn.q_proj.weight', 'model.layers.22.self_attn.v_proj.weight', 'model.layers.23.input_layernorm.weight', 'model.layers.23.mlp.down_proj.weight', 'model.layers.23.mlp.gate_proj.weight', 'model.layers.23.mlp.up_proj.weight', 'model.layers.23.post_attention_layernorm.weight', 'model.layers.23.self_attn.k_proj.weight', 'model.layers.23.self_attn.o_proj.weight', 'model.layers.23.self_attn.q_proj.weight', 'model.layers.23.self_attn.v_proj.weight', 'model.layers.24.input_layernorm.weight', 'model.layers.24.mlp.down_proj.weight', 'model.layers.24.mlp.gate_proj.weight', 'model.layers.24.mlp.up_proj.weight', 'model.layers.24.post_attention_layernorm.weight', 'model.layers.24.self_attn.k_proj.weight', 'model.layers.24.self_attn.o_proj.weight', 'model.layers.24.self_attn.q_proj.weight', 'model.layers.24.self_attn.v_proj.weight', 'model.layers.25.input_layernorm.weight', 'model.layers.25.mlp.down_proj.weight', 'model.layers.25.mlp.gate_proj.weight', 'model.layers.25.mlp.up_proj.weight', 'model.layers.25.post_attention_layernorm.weight', 'model.layers.25.self_attn.k_proj.weight', 'model.layers.25.self_attn.o_proj.weight', 'model.layers.25.self_attn.q_proj.weight', 'model.layers.25.self_attn.v_proj.weight', 'model.layers.26.input_layernorm.weight', 'model.layers.26.mlp.down_proj.weight', 'model.layers.26.mlp.gate_proj.weight', 'model.layers.26.mlp.up_proj.weight', 'model.layers.26.post_attention_layernorm.weight', 'model.layers.26.self_attn.k_proj.weight', 'model.layers.26.self_attn.o_proj.weight', 'model.layers.26.self_attn.q_proj.weight', 'model.layers.26.self_attn.v_proj.weight', 'model.layers.27.input_layernorm.weight', 'model.layers.27.mlp.down_proj.weight', 'model.layers.27.mlp.gate_proj.weight', 'model.layers.27.mlp.up_proj.weight', 'model.layers.27.post_attention_layernorm.weight', 'model.layers.27.self_attn.k_proj.weight', 'model.layers.27.self_attn.o_proj.weight', 'model.layers.27.self_attn.q_proj.weight', 'model.layers.27.self_attn.v_proj.weight', 'model.layers.3.input_layernorm.weight', 'model.layers.3.mlp.down_proj.weight', 'model.layers.3.mlp.gate_proj.weight', 'model.layers.3.mlp.up_proj.weight', 'model.layers.3.post_attention_layernorm.weight', 'model.layers.3.self_attn.k_proj.weight', 'model.layers.3.self_attn.o_proj.weight', 'model.layers.3.self_attn.q_proj.weight', 'model.layers.3.self_attn.v_proj.weight', 'model.layers.4.input_layernorm.weight', 'model.layers.4.mlp.down_proj.weight', 'model.layers.4.mlp.gate_proj.weight', 'model.layers.4.mlp.up_proj.weight', 'model.layers.4.post_attention_layernorm.weight', 'model.layers.4.self_attn.k_proj.weight', 'model.layers.4.self_attn.o_proj.weight', 'model.layers.4.self_attn.q_proj.weight', 'model.layers.4.self_attn.v_proj.weight', 'model.layers.5.input_layernorm.weight', 'model.layers.5.mlp.down_proj.weight', 'model.layers.5.mlp.gate_proj.weight', 'model.layers.5.mlp.up_proj.weight', 'model.layers.5.post_attention_layernorm.weight', 'model.layers.5.self_attn.k_proj.weight', 'model.layers.5.self_attn.o_proj.weight', 'model.layers.5.self_attn.q_proj.weight', 'model.layers.5.self_attn.v_proj.weight', 'model.layers.6.input_layernorm.weight', 'model.layers.6.mlp.down_proj.weight', 'model.layers.6.mlp.gate_proj.weight', 'model.layers.6.mlp.up_proj.weight', 'model.layers.6.post_attention_layernorm.weight', 'model.layers.6.self_attn.k_proj.weight', 'model.layers.6.self_attn.o_proj.weight', 'model.layers.6.self_attn.q_proj.weight', 'model.layers.6.self_attn.v_proj.weight', 'model.layers.7.input_layernorm.weight', 'model.layers.7.mlp.down_proj.weight', 'model.layers.7.mlp.gate_proj.weight', 'model.layers.7.mlp.up_proj.weight', 'model.layers.7.post_attention_layernorm.weight', 'model.layers.7.self_attn.k_proj.weight', 'model.layers.7.self_attn.o_proj.weight', 'model.layers.7.self_attn.q_proj.weight', 'model.layers.7.self_attn.v_proj.weight', 'model.layers.8.input_layernorm.weight', 'model.layers.8.mlp.down_proj.weight', 'model.layers.8.mlp.gate_proj.weight', 'model.layers.8.mlp.up_proj.weight', 'model.layers.8.post_attention_layernorm.weight', 'model.layers.8.self_attn.k_proj.weight', 'model.layers.8.self_attn.o_proj.weight', 'model.layers.8.self_attn.q_proj.weight', 'model.layers.8.self_attn.v_proj.weight', 'model.layers.9.input_layernorm.weight', 'model.layers.9.mlp.down_proj.weight', 'model.layers.9.mlp.gate_proj.weight', 'model.layers.9.mlp.up_proj.weight', 'model.layers.9.post_attention_layernorm.weight', 'model.layers.9.self_attn.k_proj.weight', 'model.layers.9.self_attn.o_proj.weight', 'model.layers.9.self_attn.q_proj.weight', 'model.layers.9.self_attn.v_proj.weight', 'model.norm.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

from transformers import AutoTokenizer,AutoModelForSequenceClassification,Trainer,TrainingArguments,pipeline from datasets import load_dataset import pandas as pd import numpy as np import seaborn as sn import matplotlib.pyplot as plt plt.switch_backend('agg') from sklearn.metrics import roc_auc_score,f1_score,confusion_matrix from sklearn.model_selection import train_test_split import torch from pprint import pprint #读取文件 df = pd.read_csv('test.csv',encoding='gbk') #定义转换字典 target_map={ 'positive':2, 'negative':1, 'neutral':0 } #将文件中的对应单词转换为数字 单独列出一列 df['target'] = df['sentiment'].map(target_map) #将文本和标签提取出来 这里导出为新的csv文件 方便后续load_data df2 = df[['text','target']] df2.columns=['sentence','label'] df2.to_csv('data.csv',index=None) raw_datasets=load_dataset('csv',data_files='data.csv') #加载训练的数据格式 split =raw_datasets['train'].train_test_split(test_size=0.3, seed=42) #分隔为数据集和测试集 测试集占百分之30 # tokenizer = AutoTokenizer.from_pretrained('bert-base-chinese') bert_model_path = "bert-base-chinese" tokenizer = AutoTokenizer.from_pretrained(bert_model_path) model = AutoModelForSequenceClassification.from_pretrained(bert_model_path, num_labels=3) def tokenize_fn(batch): return tokenizer(batch['sentence'],truncation=True) #将数据集中的句子使用标记器进行标记化处理,以便后续在自然语言处理任务中使用 tokenized_datasets = split.map(tokenize_fn,batched=True) model = AutoModelForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=3) params_before = [] # 遍历模型中的所有参数,并将其初始值添加到params_before列表中,以便后续比较模型参数的变化。 for name, p in model.named_parameters(): params_before.append(p.detach().cpu().numpy()) training_args = TrainingArguments( output_dir='training_dir', #输出文件夹 evaluation_strategy='epoch', save_strategy='epoch', num_train_epochs=3, #训练轮次 # 将每批训练量减半 # per_device_train_batch_size=16, # per_device_eval_batch_size=64, per_device_train_batch_size=4, per_device_eval_batch_size=32 ) def compute_metrics(logits_and_labels): logits, labels = logits_and_labels predictions = np.argmax(logits, axis=-1) acc = np.mean(predictions == labels) f1 = f1_score(labels, predictions, average='macro') # F1=2*准确率*召回率/(准确率+召回率) return {'===accuracy===': acc, '===f1===': f1} trainer = Trainer( model, # 模型实例,用于训练 training_args, # 训练参数,包括学习率、批大小、训练轮数等 train_dataset=tokenized_datasets["train"], # 训练数据集 eval_dataset=tokenized_datasets["test"], # 验证数据集 tokenizer=tokenizer, # 分词器,用于对输入进行分词 compute_metrics=compute_metrics # 用于计算性能指标的函数 ) trainer.train() 解释上述代码

import torch import pandas as pd import numpy as np from sklearn.metrics import classification_report, confusion_matrix import seaborn as sns import matplotlib.pyplot as plt from tqdm import tqdm from torch.utils.data import Dataset, DataLoader # 关键修复:添加缺失的导入 from transformers import BertTokenizer, BertModel # 从训练代码中导入必要组件 from bret_train import WebsiteDataset, WebsiteClassifier, DEVICE def load_model(model_path, bert_model_dir="./model/bert-base-multilingual-cased"): """ 加载保存的模型 """ # 加载模型检查点 checkpoint = torch.load(model_path, map_location=DEVICE) # 初始化BERT模型和分词器 tokenizer = BertTokenizer.from_pretrained(bert_model_dir) bert_model = BertModel.from_pretrained(bert_model_dir) # 创建分类器模型 num_classes = len(checkpoint['label_map']) model = WebsiteClassifier(bert_model, num_classes) model.load_state_dict(checkpoint['model_state_dict']) model.to(DEVICE) model.eval() return model, tokenizer, checkpoint['label_map'] def evaluate_model(model, test_loader): """ 在测试集上评估模型性能 """ true_labels = [] pred_labels = [] all_probs = [] texts = [] with torch.no_grad(): for batch in tqdm(test_loader, desc="测试批次"): # 获取数据 input_ids = batch['input_ids'].to(DEVICE) attention_mask = batch['attention_mask'].to(DEVICE) labels = batch['label'].cpu().numpy() # 预测 outputs = model(input_ids, attention_mask) probs = torch.nn.functional.softmax(outputs, dim=1).cpu().numpy() preds = outputs.argmax(dim=1).cpu().numpy() # 收集结果 true_labels.extend(labels) pred_labels.extend(preds) all_probs.extend(probs) # 直接获取文本内容 texts.extend([str(text) for text in batch['website_text']]) return true_labels, pred_labels, all_probs, texts def generate_classification_report(true_labels, pred_labels, label_map): """ 生成分类报告和混淆矩阵 """ # 反转标签映射 inv_label_map = {v: k for k, v in label_map.items()} class_names = [inv_label_map[i] for i in range(len(label_map))] # 生成分类报告 report = classification_report( true_labels, pred_labels, target_names=class_names, output_dict=True, zero_division=0 ) # 转换为DataFrame便于显示 report_df = pd.DataFrame(report).transpose() # 生成混淆矩阵 cm = confusion_matrix(true_labels, pred_labels) cm_df = pd.DataFrame(cm, index=class_names, columns=class_names) return report_df, cm_df def plot_confusion_matrix(cm_df, title='Confusion Matrix'): """ 绘制混淆矩阵热力图 """ plt.figure(figsize=(12, 10)) sns.heatmap(cm_df, annot=True, fmt='d', cmap='Blues', cbar=False) plt.title(title) plt.ylabel('真实标签') plt.xlabel('预测标签') plt.xticks(rotation=45) plt.yticks(rotation=0) plt.tight_layout() plt.savefig('confusion_matrix.png') plt.show() def save_detailed_results(df, filename="detailed_results.csv"): """ 保存详细预测结果 """ df.to_csv(filename, index=False, encoding='utf-8-sig') print(f"详细预测结果已保存至: {filename}") def main(): # 配置参数 MODEL_PATH = "website_classifier.pth" # 训练好的模型路径 TEST_CSV = "test_dataset.csv" # 测试集路径 BATCH_SIZE = 16 # 1. 加载模型 print("加载训练好的模型...") model, tokenizer, label_map = load_model(MODEL_PATH) print(f"模型加载完成,类别数: {len(label_map)}") # 2. 准备测试数据 print("\n准备测试数据...") test_df = pd.read_csv(TEST_CSV) test_dataset = WebsiteDataset(test_df, tokenizer, label_map) test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False) # 3. 模型预测 print("\n运行模型预测...") true_labels, pred_labels, all_probs, texts = evaluate_model(model, test_loader) # 4. 生成报告 print("\n生成分类报告...") report_df, cm_df = generate_classification_report(true_labels, pred_labels, label_map) # 打印分类报告 print("\n================ 分类报告 ================") print(report_df) # 打印混淆矩阵 print("\n================ 混淆矩阵 ================") print(cm_df) # 可视化混淆矩阵 plot_confusion_matrix(cm_df) # 5. 保存详细结果 print("\n保存详细预测结果...") inv_label_map = {v: k for k, v in label_map.items()} results_df = pd.DataFrame({ 'text': texts, 'true_label': [inv_label_map[l] for l in true_labels], 'predicted_label': [inv_label_map[l] for l in pred_labels], 'confidence': [max(probs) for probs in all_probs] }) # 添加每个类别的概率 for i in range(len(label_map)): class_name = inv_label_map[i] results_df[f'prob_{class_name}'] = [prob[i] for prob in all_probs] save_detailed_results(results_df) # 6. 打印关键指标 accuracy = report_df.loc['accuracy', 'f1-score'] macro_f1 = report_df.loc['macro avg', 'f1-score'] weighted_f1 = report_df.loc['weighted avg', 'f1-score'] print("\n================ 关键指标 ================") print(f"测试集样本数: {len(test_df)}") print(f"整体准确率: {accuracy:.4f}") print(f"宏平均F1分数: {macro_f1:.4f}") print(f"加权平均F1分数: {weighted_f1:.4f}") if __name__ == "__main__": main() 代码报错 (website_classifier) wxy@wxy:~/Workspace/bert_web_classifier$ python classifier_output.py 使用设备: cpu 加载训练好的模型... 模型加载完成,类别数: 34 准备测试数据... 运行模型预测... 测试批次: 0%| | 0/161 [00:02<?, ?it/s] Traceback (most recent call last): File "/home/wxy/Workspace/bert_web_classifier/classifier_output.py", line 178, in <module> main() File "/home/wxy/Workspace/bert_web_classifier/classifier_output.py", line 132, in main true_labels, pred_labels, all_probs, texts = evaluate_model(model, test_loader) File "/home/wxy/Workspace/bert_web_classifier/classifier_output.py", line 61, in evaluate_model texts.extend([str(text) for text in batch['website_text']]) KeyError: 'website_text'

import json import torch from typing import Dict, List from torch.utils.data import Dataset import transformers from peft import LoraConfig, TaskType, get_peft_model from torch.utils.data import DataLoader, SequentialSampler from transformers import Trainer, TrainingArguments from lora_plus import LoraPlusTrainer from torch.utils.data import RandomSampler from swanlab.integration.transformers import SwanLabCallback import swanlab import numpy as np import pandas as pd import re from typing import Dict, List import torch from tqdm import tqdm from transformers import PreTrainedTokenizer from transformers import AutoTokenizer import torch.nn as nn from lora_plus import LoraPlusTrainer # 确保已安装lora_plus库 # 初始化SwanLab swanlab.init("Finetune-Llama3.2-with-Encoder") swanlab_callback = SwanLabCallback( project="Finetune-Llama3.2-with-Encoder", experiment_name="Finetune-Llama3.2-with-Encoder" ) # 常量定义 CHEM_FORMULA_SIZE = r"([A-Z][a-z]*)([0-9]*)" VALID_ELEMENTS = ["C", "N", "P", "O", "S", "Si", "I", "H", "Cl", "F", "Br", "B", "Se", "Fe", "Co", "As", "K", "Na"] element_to_idx = {elem: idx for idx, elem in enumerate(VALID_ELEMENTS)} # 化学式转密集向量 def formula_to_dense(chem_formula: str) -> torch.Tensor: dense_vec = torch.zeros(len(VALID_ELEMENTS), dtype=torch.float32) matches = re.findall(CHEM_FORMULA_SIZE, chem_formula) for chem_symbol, num_str in matches: num = 1 if num_str == "" else int(num_str) if chem_symbol in element_to_idx: idx = element_to_idx[chem_symbol] dense_vec[idx] += num return dense_vec # 位置编码生成 (PyTorch实现) def positional_encoding(max_position: int, d_model: int, min_freq: float = 1e-4) -> torch.Tensor: position = torch.arange(max_position).unsqueeze(1) div_term = torch.exp(torch.arange(0, d_model, 2) * (-torch.log(torch.tensor(min_freq)) / d_model)) pos_enc = torch.zeros(max_position, d_model) pos_enc[:, 0::2] = torch.sin(position * div_term) pos_enc[:, 1::2] = torch.cos(position * div_term) return pos_enc # 初始化位置编码矩阵 P = positional_encoding(2000000, 256) dimn = 256 # 与位置编码维度一致 # 质谱数据编码 def encode_spectra(rag_tensor: list, P: torch.Tensor, dimn: int) -> torch.Tensor: encoded_list = [] for sample in rag_tensor: mz_list, intensity_list = sample # 创建基础特征矩阵 [m/z, intensity] base_features = torch.tensor([mz_list, intensity_list], dtype=torch.float32).T # 添加位置编码特征 pos_enc = torch.stack([P[min(int(mz), P.size(0)-1)] for mz in mz_list]) # 组合所有特征 [m/z, intensity, pos_enc...] features = torch.cat([base_features, pos_enc], dim=1) # 填充/截断到固定长度 if features.size(0) < 501: padding = torch.zeros(501 - features.size(0), features.size(1)) features = torch.cat([features, padding], dim=0) else: features = features[:501] encoded_list.append(features) return torch.stack(encoded_list) # 质谱数据预处理 def preprocess_spectra(df: pd.DataFrame) -> list: spectra_list = [] for idx, row in tqdm(df.iterrows(), total=len(df)): spectrum_str = row['Spectrum'] total_mass = row['Total Exact Mass'] # 解析质谱字符串 pairs = spectrum_str.split() mz_list, intensity_list = [], [] for pair in pairs: mz, intensity = pair.split(':') mz_list.append(float(mz)) intensity_list.append(float(intensity)) # 添加总精确质量 mz_list.append(total_mass) intensity_list.append(0.0) # 四舍五入处理 mz_list = [round(mz, 2) for mz in mz_list] intensity_list = [round(intensity, 2) for intensity in intensity_list] spectra_list.append([mz_list, intensity_list]) return spectra_list # 自定义数据集类 class MolecularDataset(Dataset): def __init__(self, csv_path: str, tokenizer: AutoTokenizer, max_seq_len: int = 512): self.df = pd.read_csv(csv_path) self.tokenizer = tokenizer self.max_seq_len = max_seq_len # 预处理质谱数据 spectra_data = preprocess_spectra(self.df) self.spec_encoded = encode_spectra(spectra_data, P, dimn) def __len__(self): return len(self.df) def __getitem__(self, idx) -> dict: # 分子式向量 formula = self.df.iloc[idx]['Molecular Formula'] formula_vec = formula_to_dense(formula) # 质谱矩阵 spec_matrix = self.spec_encoded[idx] # SELFIES编码 selfies_str = self.df.iloc[idx]['SELFIES'] encoding = self.tokenizer( selfies_str, add_special_tokens=True, padding='max_length', truncation=True, return_attention_mask=True, max_length=self.max_seq_len, return_tensors='pt' ) return { 'formula_vec': formula_vec, 'spec_matrix': spec_matrix, 'input_ids': encoding['input_ids'].squeeze(0), 'attention_mask': encoding['attention_mask'].squeeze(0) } # 加载tokenizer tokenizer = AutoTokenizer.from_pretrained('/root/workspace/checkpoint-2500') # 创建数据集 dataset = MolecularDataset('/root/workspace/SELFIES-SFT.csv', tokenizer) data_collator = transformers.DataCollatorForSeq2Seq(tokenizer=tokenizer) # 定义带额外Encoder的自定义模型 class LlamaWithEncoder(nn.Module): def __init__(self, base_model, encoder1_dim=18, encoder2_dim=256, hidden_dim=512): super().__init__() self.base_model = base_model self.config = base_model.config # 添加这行关键代码 # 第一个Transformer Encoder encoder1_layer = nn.TransformerEncoderLayer( d_model=encoder1_dim, nhead=3, dim_feedforward=hidden_dim, batch_first=True ) self.encoder1 = nn.TransformerEncoder(encoder1_layer, num_layers=2) # 第二个Transformer Encoder encoder2_layer = nn.TransformerEncoderLayer( d_model=encoder2_dim, nhead=8, dim_feedforward=hidden_dim, batch_first=True ) self.encoder2 = nn.TransformerEncoder(encoder2_layer, num_layers=2) # 投影层 self.proj1 = nn.Linear(encoder1_dim, base_model.config.hidden_size) self.proj2 = nn.Linear(encoder2_dim, base_model.config.hidden_size) # 融合层 self.fusion = nn.Linear(2 * base_model.config.hidden_size, base_model.config.hidden_size) def prepare_inputs_for_generation(self, input_ids, past_key_values=None, **kwargs): return self.base_model.prepare_inputs_for_generation( input_ids, past_key_values=past_key_values, **kwargs ) def forward( self, input_ids=None, attention_mask=None, encoder1_inputs=None, encoder2_inputs=None, labels=None, past_key_values=None, output_attentions=None, output_hidden_states=None, return_dict=None, **kwargs ): # 处理编码器输入 enc1_out = self.encoder1(encoder1_inputs) enc1_out = enc1_out.mean(dim=1) enc1_proj = self.proj1(enc1_out) enc2_out = self.encoder2(encoder2_inputs) enc2_out = enc2_out.mean(dim=1) enc2_proj = self.proj2(enc2_out) # 融合编码器输出 fused = self.fusion(torch.cat([enc1_proj, enc2_proj], dim=1)) fused = fused.unsqueeze(1) # 获取嵌入层输出 embeddings = self.base_model.get_input_embeddings()(input_ids) # 将融合结果与第一个token的嵌入结合 if embeddings.size(1) > 0: embeddings[:, 0, :] = (embeddings[:, 0, :] + fused[:, 0, :]) / 2 # 使用修改后的嵌入调用基础模型 return self.base_model( inputs_embeds=embeddings, attention_mask=attention_mask, labels=labels, past_key_values=past_key_values, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict, **kwargs ) # 加载预训练模型 base_model = transformers.AutoModelForCausalLM.from_pretrained( "/root/workspace/checkpoint-2500", trust_remote_code=True, torch_dtype=torch.bfloat16, ) model = LlamaWithEncoder(base_model) lora_config = LoraConfig( r=8, lora_alpha=16, target_modules="all-linear", # 目标注意力层 lora_dropout=0.0, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # 输出示例:0.3% 参数可训练 training_args = TrainingArguments( output_dir="./llama3.2-SELFIES-SFT", per_device_train_batch_size=16, gradient_accumulation_steps=16, num_train_epochs=10, learning_rate=5.0e-05, optim="adamw_torch", logging_steps=10, bf16=True, save_strategy="steps", lr_scheduler_type='cosine', max_grad_norm=1.0, save_steps=2000, warmup_steps=0 ) class CustomTrainer(LoraPlusTrainer): def get_train_dataloader(self) -> DataLoader: """ Returns the training dataloader using a random sampler to shuffle the dataset. """ return DataLoader( self.train_dataset, batch_size=self.args.train_batch_size, shuffle=True, collate_fn=self.data_collator, drop_last=False, ) # 使用修改后的 CustomTrainer lp_trainer = CustomTrainer( model, training_args, train_dataset=dataset, tokenizer=tokenizer, data_collator=data_collator, callbacks=[swanlab_callback], ) lp_trainer.train() lp_trainer.save_model(output_dir='./llama3.2-SELFIES-SFT')这个代码数据集构建对吗,我的目标是输入质谱和化学式特征开始自回归生成SELFIES,在推理时并不输入任何字符串只使用质谱和化学式特征

最新推荐

recommend-type

elasticloadbalancing-jvm-0.20.1-beta-sources.jar

elasticloadbalancing-jvm-0.20.1-beta-sources.jar
recommend-type

cloudfrontkeyvaluestore-1.1.8-javadoc.jar

cloudfrontkeyvaluestore-1.1.8-javadoc.jar
recommend-type

旧型饼形图一

Webji-1.html
recommend-type

pact-jvm-provider-sbt_2.10-2.0.1-javadoc.jar

pact-jvm-provider-sbt_2.10-2.0.1-javadoc.jar
recommend-type

个人作品:使用React和Material-UI打造的赛车主题个人网站

### 知识点概述 该部分将围绕提供的文件信息进行展开,包含React框架、Material-UI库、网站性能优化、版本控制、网站部署以及相关的标签解析等详细知识点。 ### React框架 #### React简介 React是由Facebook开发和维护的一个用于构建用户界面的JavaScript库。它采用组件化的方式,使得开发者可以将UI分解为独立、可复用的组件。这些组件可以包含自己的状态,且只有状态发生变更时,才会重新渲染相应的组件,从而提高应用性能。 #### React应用生命周期 在React中,组件从创建到挂载、更新再到卸载,均遵循一套生命周期方法。例如,`componentDidMount`是在组件挂载后立即调用的方法,常用于执行如数据获取这类操作。`componentDidUpdate`则是组件更新后调用,可用于与当前和之前的props进行比较,并基于比较结果执行更新操作。 ### Material-UI #### Material-UI简介 Material-UI是一个React的用户界面框架,它提供了一整套现成的组件,符合Google的Material Design设计语言。Material-UI的核心优势在于其能够快速实现美观且一致的UI界面,同时保持高度的可定制性。该框架包含各种常用的UI元素,如按钮、输入框、卡片等,并拥有丰富的主题配置选项来支持不同品牌和风格的设计需求。 #### Material-UI中的组件使用 Material-UI通过组件化的方式提供各种UI元素,开发者可以根据需要自由组合和构建界面。例如,`Button`组件可以用于创建按钮,`Card`组件用于创建卡片布局等。每个组件的使用都遵循Material-UI的设计规范,确保界面美观和用户友好。 ### 网站性能优化 #### 响应式设计 从描述中提到网站支持移动和桌面端的定制设计,这是响应式设计的核心特点。响应式设计意味着网页能够根据不同的屏幕尺寸和分辨率,自动调整布局,提供最优化的浏览体验。 #### 动画和过渡效果 网站引入了新的过渡和动画,这不仅提升了用户体验,也可能有助于页面元素间转换时的直观性。使用React可以轻松地添加和管理动画,因为状态更新时React会自动处理组件树的更新。 ### 版本控制和分叉仓库 #### 版本控制(Git) 从描述中提到可以分叉此仓库,这涉及到了Git版本控制工具的使用。Git是一个分布式版本控制系统,用于跟踪代码变更,并且支持协作开发。通过分叉仓库,开发者可以从原始项目创建一个副本,可以在副本上进行修改而不影响原项目。 #### 分叉(Fork)和克隆(Clone) 分叉操作会创建一个远程仓库的副本,而克隆操作则会将远程仓库的内容复制到本地计算机。这意味着开发者可以在本地进行更改,然后选择将更改推送到自己的远程副本,或贡献回原始项目。 ### 安装与部署 #### 安装依赖项 在分叉/克隆仓库之后,开发者需要在项目目录中运行`npm install`来安装所有必需的依赖项。这一步骤是启动开发服务器前的必要准备。 #### 开发和生产环境 开发者需要区分开发环境和生产环境。开发环境通常包含调试工具,而生产环境需要优化和压缩资源以提升网站性能。通过运行`npm start`可以启动开发服务器,而使用`npm run build`则可以构建用于生产环境的静态文件。 ### 标签解析 #### React 标签提及了React,这意味着项目是基于React框架开发的,相关开发者和招聘者可能会搜索此标签以找到具有React开发经验的人才。 #### Blog & personal-website 标签还提到了博客和个人网站,这可能意味着项目是一个个人博客或者展示个人技能和经历的网站,通常用于职业发展或个人品牌建设。 #### Resume-website 标签提到了简历网站,这可以是一个在线简历或个人作品集的展示平台,提供了一个展示个人技能和经验的数字平台,便于求职者与潜在雇主之间沟通。 ### 结语 以上是根据提供的文件信息总结的详细知识点,涵盖了React和Material-UI框架的使用、网站性能优化、版本控制和仓库分叉、安装和部署流程以及对相关标签的解释和应用。这些内容不仅有助于理解该项目的技术细节,而且对于了解构建现代Web应用所涉及的技术和流程也是十分有帮助的。
recommend-type

活动图设计技巧大公开:如何在StarUML中构建高逻辑性的业务流程?

# 摘要 活动图作为软件系统建模的重要工具,能够有效表达复杂业务流程的逻辑结构与行为关系。本文围绕StarUML平台,系统探讨了活动图的设计原理、建模方法及其在实际项目中的应用价值。文章从活动图的核心元素与建模逻辑入手,分析了结构化建模方法与语义表达机制,并详细阐述了在StarUML中构建标准活动图的操作流程与优化策略。针对复杂业务场景,
recommend-type

模拟经典运放电路结构

<think>嗯,用户想了解经典运放电路的结构、模拟方法和实现方式。首先,我需要回忆运放的基本结构。运放通常由差分输入级、中间增益级和输出级组成,对吧?然后,模拟方法可能涉及虚短和虚断的概念,这是分析运放电路的关键。用户提到的引用里也强调了这两个概念的重要性,所以必须详细解释。接下来,我得考虑不同的经典电路结构,比如反向放大器、同向放大器、电压跟随器、加法器、微分器和积分器。每个电路的结构和公式需要准确描述,并确保使用正确的LaTeX格式。例如,反向放大器的公式是$V_o= -\frac{R_f}{R_1}V_i$,要检查是否用$$...$$还是$...$,根据用户的要求,行内公式用$,独立公
recommend-type

MATLAB模拟无线传感器网络与区块链技术

根据给定文件信息,我们将详细探讨以下几个关键知识点: 1. 无线传感器网络(Wireless Sensor Network, WSN): 无线传感器网络是由一组具有传感器、处理单元和通信能力的小型设备组成的网络,这些设备能够相互协作,完成对环境的监测任务。无线传感器网络具有部署便捷、自组织、灵活性高等特点。它在智能交通、环境监测、智能家居等领域有着广泛的应用。 2. 区块链技术(Blockchain Technology): 区块链是一种分布式数据库技术,其特点是去中心化、数据不可篡改、信息透明。在无线传感器网络中,区块链可用于提高数据的可信度和安全性。每个节点生成的块(block)将包含一段时期内的交易信息,这些块链式地连接在一起,形成链状结构,即区块链。通过共识机制(如工作量证明PoW、权益证明PoS等),网络中的节点对数据的有效性达成一致,从而保证数据的安全性和可靠性。 3. 随机泛洪路由技术(Random Flooding Routing): 随机泛洪路由技术是一种无需路由表的简单、基于概率的路由方法。在泛洪机制中,消息从源节点发出后,每个接收到消息的节点都会以一定的概率转发给其邻居节点。该技术易于实现,但可能会导致大量重复传输,进而增加网络的负载和能量消耗。因此,随机泛洪路由通常用于对实时性要求较高,但对能量和资源消耗要求不高的场合。 4. MATLAB仿真: MATLAB是一种高级数学计算和仿真软件,它广泛应用于工程计算、控制系统、信号处理、通信系统等领域。在无线传感器网络和区块链技术的研究中,MATLAB提供了强大的仿真环境和工具箱,使得研究人员能够模拟网络行为、验证算法性能和优化系统设计。 5. 能量效率(Energy Efficiency): 在无线传感器网络的设计中,能量效率是一个核心考量因素。由于传感器节点通常由电池供电,并且电池的更换或充电往往不便或不可行,因此降低节点能耗,延长网络的生命周期至关重要。研究者需要在保证网络性能的同时,采用各种策略来减少节点的能量消耗。 6. 静态节点(Static Node): 在无线传感器网络中,静态节点指的是那些位置固定不动的节点。与移动节点相比,静态节点的网络拓扑结构相对稳定,这有助于简化路由策略的设计,并且在一定程度上提高了系统的可预测性。静态节点适用于那些对位置变化不敏感的监测任务。 7. 节点块生成(Block Generation at Nodes): 在区块链技术中,节点块生成是指每个节点按照一定的规则(如PoW、PoS等)打包一段时间内的交易记录,生成新的数据块,并将其加入到区块链中的过程。每个新生成的块都包含前一个块的哈希值,确保了链的连续性和不可篡改性。在无线传感器网络中,节点生成块的过程也是数据交换的一部分,每个节点在完成数据处理和转发后,可能会产生新的块。 综合以上知识点,我们可以了解到,给定文件中的MATLAB仿真代码是专门用于无线传感器网络环境的仿真,其中实现了随机泛洪路由技术来模拟数据传输过程,并通过节点上生成块的方式构建了区块链。该代码特别适用于静态节点环境,其目的在于研究如何降低能量消耗,并保证数据传输的可靠性和安全性。代码的开源性将有助于研究人员和开发者访问、使用、修改和进一步优化该仿真模型。
recommend-type

UML状态图深度剖析:掌握对象生命周期建模的7个关键要点

# 摘要 UML状态图是描述系统动态行为的核心建模工具,广泛应用于软件与系统设计中。本文系统阐述了状态图的基本概念与理论基础,深入分析了状态、转移、复合结构及并发机制等关键建模元素,并详细探讨了状态图的构建流程与设计原则,强调行为建模的逻辑完整性与可维护性。结合嵌入式系统、业务流程和设计模式等实际应用场景,展示了状态图在复杂系统状态管理中的有效性。同时,本文研究了状态图与类图、序列图的协同机制,探讨了其在系统架构设计中的整合作用,并介绍了主流建模工具对状态图的支持与自动化实现方法,为工程实践提供了理论指导和技术路径。 # 关键字 UML状态图;状态转移;复合状态;并发建模;行为建模;
recommend-type

dbeaver缩小编辑页面

<think> 首先,用户的问题是:“我想查找在DBeaver中缩小编辑页面的方法 请问DBeaver缩小编辑页面的方法”。这似乎是用户在寻求如何在DBeaver数据库工具中缩小编辑页面的方法。 回顾系统级指令: - 所有行内数学表达式使用$...$格式。 - 独立公式使用$$...$$格式并单独成段。 - LaTeX语法必须正确。 - 使用中文回答。 - 生成相关问题。 - 引用的段落末尾自然地添加引用标识。 用户提供的引用: - 引用[1]:关于SQL函数和触发器的更新操作,与DBeaver缩小页面无关。 - 引用[2]:提到搜索历史、模糊查询、列编辑模式、自动换行和截图工具snipa