automodel.from_pretrained 使用本地缓存模型(huggingface.co 链接报错时可用)

huggingface.co 链接报错

在这里插入图片描述

Automodel.from_pretrained()和AAutoTokenizer.from_pretrained()同理

第一次缓存模型到指定文件夹

from transformers import AutoTokenizer, AutoModel

sen_trans_pretrained_path = os.path.join(PWD, "pretrained_w", "sentence-transformers")
model_name = 'sentence-transformers/all-MiniLM-L6-v2'

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    cache_dir=os.path.join(sen_trans_pretrained_path, "tokenizer"))

model = AutoModel.from_pretrained(
    model_name,
    cache_dir=os.path.join(sen_trans_pretrained_path, "model"))

cache_dir 指定要保存的本地目录,方便后面使用

报错后使用本地缓存model和tokenizer

from transformers import AutoTokenizer, AutoModel
sen_trans_pretrained_path = os.path.join(PWD, "pretrained_w", "sentence-transformers")
tail = 'models--sentence-transformers--all-MiniLM-L6-v2/snapshots/7dbbc90392e2f80f3d3c277d6e90027e55de9125'
tokenizer = AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path = os.path.join(sen_trans_pretrained_path, "tokenizer",tail))
   

model = AutoModel.from_pretrained(
    pretrained_model_name_or_path = os.path.join(sen_trans_pretrained_path, "model", tail))

pretrained_model_name_or_path 为之前缓存的模型和tokenizer的最里层目录
最里层目录指的是:

  • 对于tokenizer 要到词表和config这一层:在这里插入图片描述
  • 对于model 要到bin这一层:在这里插入图片描述

另一个例子

缓存
from sentence_transformers import SentenceTransformer
sen2vec = SentenceTransformer('paraphrase-MiniLM-L6-v2',cache_folder=os.getcwd()+"/flaskr/pretrained_ST")


在这里插入图片描述

本地调用
from sentence_transformers import SentenceTransformer
sen2vec = SentenceTransformer(os.getcwd()+"/flaskr/pretrained_ST/sentence-transformers_paraphrase-MiniLM-L6-v2")

参考

AutoTokenizer
AutoModel
SSLError: HTTPSConnectionPool(host=‘huggingface.co’, port=443)
class AutoTokenizer: 源码

 @classmethod
 def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
     r""" Instantiate one of the tokenizer classes of the library
     from a pre-trained model vocabulary.

     The tokenizer class to instantiate is selected
     based on the `model_type` property of the config object, or when it's missing,
     falling back to using pattern matching on the `pretrained_model_name_or_path` string:

         - `t5`: T5Tokenizer (T5 model)
         - `distilbert`: DistilBertTokenizer (DistilBert model)
         - `albert`: AlbertTokenizer (ALBERT model)
         - `camembert`: CamembertTokenizer (CamemBERT model)
         - `xlm-roberta`: XLMRobertaTokenizer (XLM-RoBERTa model)
         - `longformer`: LongformerTokenizer (AllenAI Longformer model)
         - `roberta`: RobertaTokenizer (RoBERTa model)
         - `bert-base-japanese`: BertJapaneseTokenizer (Bert model)
         - `bert`: BertTokenizer (Bert model)
         - `openai-gpt`: OpenAIGPTTokenizer (OpenAI GPT model)
         - `gpt2`: GPT2Tokenizer (OpenAI GPT-2 model)
         - `transfo-xl`: TransfoXLTokenizer (Transformer-XL model)
         - `xlnet`: XLNetTokenizer (XLNet model)
         - `xlm`: XLMTokenizer (XLM model)
         - `ctrl`: CTRLTokenizer (Salesforce CTRL model)
         - `electra`: ElectraTokenizer (Google ELECTRA model)

     Params:
         pretrained_model_name_or_path: either:

             - a string with the `shortcut name` of a predefined tokenizer to load from cache or download, e.g.: ``bert-base-uncased``.
             - a string with the `identifier name` of a predefined tokenizer that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.
             - a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.
             - (not applicable to all derived classes) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.

         cache_dir: (`optional`) string:
             Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used.

         force_download: (`optional`) boolean, default False:
             Force to (re-)download the vocabulary files and override the cached versions if they exists.

         resume_download: (`optional`) boolean, default False:
             Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.

         proxies: (`optional`) dict, default None:
             A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
             The proxies are used on each request.

         use_fast: (`optional`) boolean, default False:
             Indicate if transformers should try to load the fast version of the tokenizer (True) or use the Python one (False).

         inputs: (`optional`) positional arguments: will be passed to the Tokenizer ``__init__`` method.

         kwargs: (`optional`) keyword arguments: will be passed to the Tokenizer ``__init__`` method. Can be used to set special tokens like ``bos_token``, ``eos_token``, ``unk_token``, ``sep_token``, ``pad_token``, ``cls_token``, ``mask_token``, ``additional_special_tokens``. See parameters in the doc string of :class:`~transformers.PreTrainedTokenizer` for details.

     Examples::

         # Download vocabulary from S3 and cache.
         tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

         # Download vocabulary from S3 (user-uploaded) and cache.
         tokenizer = AutoTokenizer.from_pretrained('dbmdz/bert-base-german-cased')

         # If vocabulary files are in a directory (e.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`)
         tokenizer = AutoTokenizer.from_pretrained('./test/bert_saved_model/')

import os import torch import transformers from transformers import ( AutoModelForCausalLM, AutoTokenizer, TrainingArguments, DataCollatorForLanguageModeling, BitsAndBytesConfig, Trainer ) from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training from datasets import load_dataset import logging import psutil import gc from datetime import datetime # === 配置区域 === MODEL_NAME = "/home/vipuser/ai_writer_project_final_with_fixed_output_ui/models/Yi-6B" DATASET_PATH = "./data/train_lora_formatted.jsonl" OUTPUT_DIR = "./yi6b-lora-optimized" DEVICE_MAP = "auto" # 使用自动设备映射 # 确保输出目录存在 os.makedirs(OUTPUT_DIR, exist_ok=True) # === 内存优化配置 === os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True" # 减少内存碎片 torch.backends.cuda.cufft_plan_cache.clear() # 清理CUDA缓存 # === 增强的日志系统 === def setup_logging(output_dir): """配置日志系统,支持文件和TensorBoard""" logger = logging.getLogger(__name__) logger.setLevel(logging.INFO) # 文件日志处理器 file_handler = logging.FileHandler(os.path.join(output_dir, "training.log")) file_handler.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')) logger.addHandler(file_handler) # 控制台日志处理器 console_handler = logging.StreamHandler() console_handler.setFormatter(logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')) logger.addHandler(console_handler) # TensorBoard日志目录 tensorboard_log_dir = os.path.join(output_dir, "logs", datetime.now().strftime("%Y%m%d-%H%M%S")) os.makedirs(tensorboard_log_dir, exist_ok=True) # 安装TensorBoard回调 tb_writer = None try: from torch.utils.tensorboard import SummaryWriter tb_writer = SummaryWriter(log_dir=tensorboard_log_dir) logger.info(f"TensorBoard日志目录: {tensorboard_log_dir}") except ImportError: logger.warning("TensorBoard未安装,可视化功能不可用") return logger, tb_writer logger, tb_writer = setup_logging(OUTPUT_DIR) # === 量化配置 - 使用更高效的配置 === quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, ) # === 加载模型 === logger.info("加载预训练模型...") model = AutoModelForCausalLM.from_pretrained( MODEL_NAME, device_map=DEVICE_MAP, quantization_config=quant_config, torch_dtype=torch.bfloat16, trust_remote_code=True, attn_implementation="flash_attention_2" # 使用FlashAttention优化内存 ) # === 分词器处理 === logger.info("加载分词器...") tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True) tokenizer.padding_side = "right" if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id # === 准备模型训练 === model = prepare_model_for_kbit_training( model, use_gradient_checkpointing=True # 启用梯度检查点以节省内存 ) # === LoRA 配置 - 优化内存使用 === logger.info("配置LoRA...") lora_config = LoraConfig( r=64, # 降低rank以减少内存使用 lora_alpha=32, # 降低alpha值 target_modules=["q_proj", "v_proj"], # 减少目标模块 lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(model, lora_config) # 记录可训练参数 trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) total_params = sum(p.numel() for p in model.parameters()) logger.info(f"可训练参数: {trainable_params:,} / 总参数: {total_params:,} ({trainable_params/total_params:.2%})") # === 加载并预处理数据集 === logger.info("加载和预处理数据集...") dataset = load_dataset("json", data_files=DATASET_PATH, split="train") # 文本过滤函数 def is_valid_text(example): text = example.get("text", "") return text is not None and len(text.strip()) > 200 # 增加最小长度要求 dataset = dataset.filter(is_valid_text) logger.info(f"过滤后数据集大小: {len(dataset)} 条") # 动态填充的分词函数 - 节省内存 def tokenize_function(examples): tokenized = tokenizer( examples["text"], padding=True, # 使用动态填充 truncation=True, max_length=1024, # 降低上下文长度以减少内存使用 ) # 创建 labels - 因果语言建模需要 labels = input_ids tokenized["labels"] = tokenized["input_ids"].copy() return tokenized tokenized_dataset = dataset.map( tokenize_function, batched=True, remove_columns=["text"], batch_size=64, # 降低批处理大小以减少内存峰值 num_proc=4, # 减少进程数以降低内存开销 ) # === 数据整理器 === data_collator = DataCollatorForLanguageModeling( tokenizer=tokenizer, mlm=False # 因果语言建模 ) # === 训练参数 - 优化内存使用 === report_to_list = ["tensorboard"] if tb_writer else [] training_args = TrainingArguments( output_dir=OUTPUT_DIR, per_device_train_batch_size=4, # 大幅降低批次大小 gradient_accumulation_steps=4, # 增加梯度累积步数以保持有效批次大小 learning_rate=2e-5, num_train_epochs=3, logging_steps=50, save_strategy="steps", save_steps=500, bf16=True, optim="paged_adamw_32bit", report_to=report_to_list, warmup_ratio=0.05, gradient_checkpointing=True, # 启用梯度检查点 fp16=False, max_grad_norm=0.3, # 降低梯度裁剪阈值 remove_unused_columns=True, # 移除未使用的列以节省内存 dataloader_num_workers=4, # 减少数据加载工作线程 evaluation_strategy="steps", eval_steps=500, save_total_limit=2, # 减少保存的检查点数量 logging_dir=os.path.join(OUTPUT_DIR, "logs"), load_best_model_at_end=True, ddp_find_unused_parameters=False, logging_first_step=True, group_by_length=True, lr_scheduler_type="cosine", weight_decay=0.01, ) # === GPU监控工具 === def monitor_gpu(): """监控GPU使用情况""" if torch.cuda.is_available(): device = torch.device("cuda") mem_alloc = torch.cuda.memory_allocated(device) / 1024**3 mem_reserved = torch.cuda.memory_reserved(device) / 1024**3 mem_total = torch.cuda.get_device_properties(device).total_memory / 1024**3 return { "allocated": f"{mem_alloc:.2f} GB", "reserved": f"{mem_reserved:.2f} GB", "total": f"{mem_total:.2f} GB", "utilization": f"{mem_alloc/mem_total*100:.1f}%" } return {} # === 创建训练器 === eval_dataset = None if len(tokenized_dataset) > 100: eval_dataset = tokenized_dataset.select(range(100)) trainer = Trainer( model=model, tokenizer=tokenizer, args=training_args, train_dataset=tokenized_dataset, eval_dataset=eval_dataset, data_collator=data_collator, ) # === 训练前验证 === def validate_data_and_model(): """验证数据和模型是否准备好训练""" logger.info("\n=== 训练前验证 ===") # 检查样本格式 sample = tokenized_dataset[0] logger.info(f"样本键: {list(sample.keys())}") logger.info(f"input_ids 长度: {len(sample['input_ids'])}") # 创建单个样本测试批次 test_batch = data_collator([sample]) # 移动数据到设备 test_batch = {k: v.to(model.device) for k, v in test_batch.items()} # 前向传播测试 model.train() outputs = model(**test_batch) loss_value = outputs.loss.item() logger.info(f"测试批次损失: {loss_value:.4f}") # 记录到TensorBoard if tb_writer: tb_writer.add_scalar("debug/test_loss", loss_value, 0) # 反向传播测试 outputs.loss.backward() logger.info("反向传播成功!") # 重置梯度 model.zero_grad() logger.info("验证完成,准备开始训练\n") # 记录初始GPU使用情况 gpu_status = monitor_gpu() logger.info(f"初始GPU状态: {gpu_status}") # 记录到TensorBoard if tb_writer: tb_writer.add_text("system/initial_gpu", str(gpu_status), 0) validate_data_and_model() # === 自定义回调 - 监控资源使用 === class ResourceMonitorCallback(transformers.TrainerCallback): def __init__(self, tb_writer=None): self.tb_writer = tb_writer self.start_time = datetime.now() self.last_log_time = datetime.now() def on_step_end(self, args, state, control, **kwargs): current_time = datetime.now() time_diff = (current_time - self.last_log_time).total_seconds() # 每分钟记录一次资源使用情况 if time_diff > 60: self.last_log_time = current_time # GPU监控 gpu_status = monitor_gpu() logger.info(f"Step {state.global_step} - GPU状态: {gpu_status}") # CPU和内存监控 cpu_percent = psutil.cpu_percent() mem = psutil.virtual_memory() logger.info(f"CPU使用率: {cpu_percent}%, 内存使用: {mem.used/1024**3:.2f}GB/{mem.total/1024**3:.2f}GB") # 记录到TensorBoard if self.tb_writer: # GPU显存使用 if torch.cuda.is_available(): device = torch.device("cuda") mem_alloc = torch.cuda.memory_allocated(device) / 1024**3 self.tb_writer.add_scalar("system/gpu_mem", mem_alloc, state.global_step) # CPU使用率 self.tb_writer.add_scalar("system/cpu_usage", cpu_percent, state.global_step) # 系统内存使用 self.tb_writer.add_scalar("system/ram_usage", mem.used/1024**3, state.global_step) def on_log(self, args, state, control, logs=None, **kwargs): """记录训练指标到TensorBoard""" if self.tb_writer and logs is not None: for metric_name, metric_value in logs.items(): if "loss" in metric_name or "lr" in metric_name or "grad_norm" in metric_name: self.tb_writer.add_scalar(f"train/{metric_name}", metric_value, state.global_step) def on_train_end(self, args, state, control, **kwargs): """训练结束记录总间""" training_time = datetime.now() - self.start_time logger.info(f"训练总间: {training_time}") if self.tb_writer: self.tb_writer.add_text("system/total_time", str(training_time)) # 添加回调 trainer.add_callback(ResourceMonitorCallback(tb_writer=tb_writer)) # === 内存清理函数 === def clear_memory(): """清理内存和GPU缓存""" gc.collect() if torch.cuda.is_available(): torch.cuda.empty_cache() torch.cuda.ipc_collect() logger.info("内存清理完成") # === 启动训练 === try: logger.info("开始训练...") # 分阶段训练以减少内存峰值 num_samples = len(tokenized_dataset) chunk_size = 1000 # 每次处理1000个样本 for i in range(0, num_samples, chunk_size): end_idx = min(i + chunk_size, num_samples) logger.info(f"训练样本 {i} 到 {end_idx-1} / {num_samples}") # 创建子数据集 chunk_dataset = tokenized_dataset.select(range(i, end_idx)) # 更新训练器 trainer.train_dataset = chunk_dataset # 训练当前块 trainer.train() # 清理内存 clear_memory() # 保存训练指标 metrics = trainer.evaluate() trainer.log_metrics("train", metrics) trainer.save_metrics("train", metrics) # 保存最佳模型 trainer.save_model(OUTPUT_DIR) tokenizer.save_pretrained(OUTPUT_DIR) logger.info(f"训练完成! 模型保存在: {OUTPUT_DIR}") # 记录最终指标到TensorBoard if tb_writer: for metric_name, metric_value in metrics.items(): tb_writer.add_scalar(f"final/{metric_name}", metric_value) tb_writer.close() except Exception as e: logger.error(f"训练出错: {e}") import traceback logger.error(traceback.format_exc()) # 尝试更小批量训练 logger.info("\n尝试更小批量训练...") small_dataset = tokenized_dataset.select(range(50)) trainer.train_dataset = small_dataset trainer.train() # 保存模型 trainer.save_model(f"{OUTPUT_DIR}_small") tokenizer.save_pretrained(f"{OUTPUT_DIR}_small") logger.info(f"小批量训练完成! 模型保存在: {OUTPUT_DIR}_small") # 记录错误到TensorBoard if tb_writer: tb_writer.add_text("error/exception", traceback.format_exc()) # 清理内存 clear_memory() # === 训练后验证 === def validate_final_model(): """验证训练后的模型""" logger.info("\n=== 训练后验证 ===") # 加载保存的模型 from peft import PeftModel # 仅加载基础模型配置 base_model = AutoModelForCausalLM.from_pretrained( MODEL_NAME, device_map=DEVICE_MAP, quantization_config=quant_config, torch_dtype=torch.bfloat16, trust_remote_code=True, load_in_4bit=True ) # 加载LoRA适配器 peft_model = PeftModel.from_pretrained(base_model, OUTPUT_DIR) # 合并LoRA权重 merged_model = peft_model.merge_and_unload() # 测试生成 prompt = "中国的首都是" inputs = tokenizer(prompt, return_tensors="pt").to(merged_model.device) outputs = merged_model.generate( **inputs, max_new_tokens=50, # 减少生成长度 temperature=0.7, top_p=0.9, repetition_penalty=1.2, do_sample=True ) generated = tokenizer.decode(outputs[0], skip_special_tokens=True) logger.info(f"提示: {prompt}") logger.info(f"生成结果: {generated}") # 记录到TensorBoard if tb_writer: tb_writer.add_text("validation/sample", f"提示: {prompt}\n生成: {generated}") # 更全面的测试 test_prompts = [ "人工智能的未来发展趋势是", "如何学习深度学习?", "写一个关于太空探索的短故事:" ] for i, test_prompt in enumerate(test_prompts): inputs = tokenizer(test_prompt, return_tensors="pt").to(merged_model.device) outputs = merged_model.generate( **inputs, max_new_tokens=100, # 减少生成长度 temperature=0.7, top_p=0.9, repetition_penalty=1.2, do_sample=True ) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) logger.info(f"\n提示: {test_prompt}\n生成: {generated_text}\n{'='*50}") # 记录到TensorBoard if tb_writer: tb_writer.add_text(f"validation/test_{i}", f"提示: {test_prompt}\n生成: {generated_text}") logger.info("验证完成") # 执行验证 validate_final_model() # 关闭TensorBoard写入器 if tb_writer: tb_writer.close() logger.info("TensorBoard日志已关闭") (.venv) (base) vipuser@ubuntu22:~/ai_writer_project_final_with_fixed_output_ui$ python train_lora.py 2025-07-13 22:10:19,098 - INFO - TensorBoard日志目录: ./yi6b-lora-optimized/logs/20250713-221019 2025-07-13 22:10:19,099 - INFO - 加载预训练模型... Traceback (most recent call last): File "/home/vipuser/ai_writer_project_final_with_fixed_output_ui/train_lora.py", line 77, in <module> model = AutoModelForCausalLM.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/vipuser/ai_writer_project_final_with_fixed_output_ui/.venv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained return model_class.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/vipuser/ai_writer_project_final_with_fixed_output_ui/.venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3590, in from_pretrained config = cls._autoset_attn_implementation( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/vipuser/ai_writer_project_final_with_fixed_output_ui/.venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 1389, in _autoset_attn_implementation cls._check_and_enable_flash_attn_2( File "/home/vipuser/ai_writer_project_final_with_fixed_output_ui/.venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 1480, in _check_and_enable_flash_attn_2 raise ImportError(f"{preface} the package flash_attn seems to be not installed. {install_message}") ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2.
最新发布
07-14
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值