tensorflow 程序挂起的原因,即整个进程不报错又不执行的原因

在测试集上对训练好的模型实验时,代码无报错却卡在session.run()处。原因是tf数据线程未启动,导致数据流图无法计算。因tensorflow计算和数据读入异步,无数据时会一直等待。解决办法有两种,一是启动线程并做同步,二是使用Supervisor简化操作。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

在测试集上对训练好的模型进行实验的时候,发现写好的代码没有报错但是会卡在session.run()那里不动。查了好久终于明白是开启线程的问题。

tf的数据线程没有启动,导致数据流图没办法计算,整个程序就卡在那里。

更深层次的原因是tensorflow的计算和数据读入是异步的,合理的方式是主线程进行模型的训练,然后开一个数据读入线程异步读入数据.tensorflow会在内存中维护一个队列,然后数据线程异步从磁盘中将样本推入队列当中。并且,因为tensorflow的训练和读数据是异步的,故即使当前没有数据进来,tensorflow也没办法报错,因为可能接下来会有数据进队列,所以,tensorflow就一直处于等待的状态
下面以别人的例子说明:

错误情况:

在Session当中,没有启动数据读入线程。所以,sess.run(train_input.input_data)就是无数据可取,程序就处于一种挂起的状态。

#-*- coding:utf-8 -*-
import numpy as np
import tensorflow as tf
 
from tensorflow.models.rnn.ptb import reader
 
class PTBInput(object):
  """The input data."""
  def __init__(self, config, data, name=None):
    self.batch_size = batch_size = config.batch_size
    self.num_steps = num_steps = config.num_steps
    #为何要进行-1操作
    self.epoch_size = ((len(data) // batch_size) - 1) // num_steps
    self.input_data, self.targets = reader.ptb_producer(
        data, batch_size, num_steps, name=name)
 
class SmallConfig(object):
  """Small config."""
  init_scale = 0.1
  learning_rate = 1.0
  max_grad_norm = 5
  num_layers = 2
  num_steps = 20
  hidden_size = 200
  max_epoch = 4
  max_max_epoch = 13
  keep_prob = 1.0
  lr_decay = 0.5
  batch_size = 20
  vocab_size = 10000
 
if __name__ == '__main__':
	config = SmallConfig()
        data_path = '/home/jdlu/jdluTensor/data/simple-examples/data'       
	raw_data = reader.ptb_raw_data(data_path)
	train_data, valid_data, test_data, _ = raw_data
	train_input = PTBInput(config=config, data=train_data, name="TrainInput")
        print "end--------------------------------"
        
	#wrong,使用session就会出现读不出数据的错误,读不出数据,整个数据流图就无法计算,整个程序就处于挂起的状态
	#使用session会出错
	with tf.Session() as sess:
		for step in range(1):
			print sess.run(train_input.input_data)	

解决方法:

有两种办法:

  1. 使用tf.train.range_input_producer(epoch_size, shuffle=False),会默认将QueueRunner添加到全局图中,我们必须使用tf.train.start_queue_runners(sess=sess),去启动该线程。然后使用coord = tf.train.Coordinator()去做一些线程的同步工作。
  2. 第二种方法比较简单,使用sv = tf.train.Supervisor(),文档上说,The Supervisor is a small wrapper around a Coordinator, a Saver, and a SessionManager
    也即使用了Supervisor(),那么保存模型,线程同步的事情都不用我们去干涉了。
     
#-*- coding:utf-8 -*-
import numpy as np
import tensorflow as tf
 
from tensorflow.models.rnn.ptb import reader
 
class PTBInput(object):
  """The input data."""
  def __init__(self, config, data, name=None):
    self.batch_size = batch_size = config.batch_size
    self.num_steps = num_steps = config.num_steps
    #为何要进行-1操作
    self.epoch_size = ((len(data) // batch_size) - 1) // num_steps
    self.input_data, self.targets = reader.ptb_producer(
        data, batch_size, num_steps, name=name)
 
class SmallConfig(object):
  """Small config."""
  init_scale = 0.1
  learning_rate = 1.0
  max_grad_norm = 5
  num_layers = 2
  num_steps = 20
  hidden_size = 200
  max_epoch = 4
  max_max_epoch = 13
  keep_prob = 1.0
  lr_decay = 0.5
  batch_size = 20
  vocab_size = 10000
 
if __name__ == '__main__':
	config = SmallConfig()
        data_path = '/home/jdlu/jdluTensor/data/simple-examples/data'       
	raw_data = reader.ptb_raw_data(data_path)
	train_data, valid_data, test_data, _ = raw_data
	train_input = PTBInput(config=config, data=train_data, name="TrainInput")
        print "end--------------------------------"
        
 
	#right,使用Supervisor() 方法二
	#sv = tf.train.Supervisor()
        #with sv.managed_session() as sess:
	#	for step in range(1):
	#		print sess.run(train_input.input_data)	
        
	#right 方法一
	# Create a session for running operations in the Graph.
	sess = tf.Session()
	# Start input enqueue threads.
	coord = tf.train.Coordinator()
	threads = tf.train.start_queue_runners(sess=sess, coord=coord)
	# Run training steps or whatever
	try:
		for step in range(2):
			print sess.run(train_input.input_data)
	except Exception,e:
		#Report exceptions to the coordinator
		coord.request_stop(e)
	coord.request_stop()
	# Terminate as usual.  It is innocuous to request stop twice.
	coord.join(threads)
	sess.close()

 

``` [root@190f3c453709 inference]# python nf4.py /usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/utils/storage.py:38: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() if self.device.type != 'cpu': Some weights of PanguForCausalLM were not initialized from the model checkpoint at /models/z50051264/checkpoints and are newly initialized: ['model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. *****************模型加载成功! ****************[+] load time: 27.4145s 模型词汇量: 153376 Tokenizer词汇量: 153376 unk_token: <unk> pad_token: None <s>I love Hugging Face! *****************分词器加载成功,开始推理! [+] inference time: 5.57427s [' <s> 你是谁?你要我提供什么类型的内容?\n\n**回答者:人工智能助手\n\n问题有什么可以为我服务的呢?\n?\n在\n?\n\n?\n\n## \n是吗?你是一种智能机器人么 AI, [unused10]'] [root@190f3c453709 inference]# python -m pdb nf4.py > /models/z50051264/bitsandbytes-pangu/examples/inference/nf4.py(1)<module>() -> import time (Pdb) n > /models/z50051264/bitsandbytes-pangu/examples/inference/nf4.py(2)<module>() -> import torch, torch_npu (Pdb) n > /models/z50051264/bitsandbytes-pangu/examples/inference/nf4.py(3)<module>() -> from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig (Pdb) > /models/z50051264/bitsandbytes-pangu/examples/inference/nf4.py(9)<module>() -> MODEL_PATH = "/models/z50051264/checkpoints" (Pdb) > /models/z50051264/bitsandbytes-pangu/examples/inference/nf4.py(11)<module>() -> bnb_config = BitsAndBytesConfig( (Pdb) > /models/z50051264/bitsandbytes-pangu/examples/inference/nf4.py(12)<module>() -> load_in_4bit=True, (Pdb) > /models/z50051264/bitsandbytes-pangu/examples/inference/nf4.py(13)<module>() -> bnb_4bit_compute_dtype=torch.bfloat16, # Support torch.float16, torch.float32, torch.bfloat16 (Pdb) > /models/z50051264/bitsandbytes-pangu/examples/inference/nf4.py(14)<module>() -> bnb_4bit_quant_type="nf4", # # Only support `nf4` (Pdb) > /models/z50051264/bitsandbytes-pangu/examples/inference/nf4.py(15)<module>() -> bnb_4bit_use_double_quant=False (Pdb) > /models/z50051264/bitsandbytes-pangu/examples/inference/nf4.py(11)<module>() -> bnb_config = BitsAndBytesConfig( (Pdb) > /models/z50051264/bitsandbytes-pangu/examples/inference/nf4.py(18)<module>() -> torch.npu.synchronize() (Pdb) RuntimeError: SetPrecisionMode:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:156 NPU function error: at_npu::native::AclSetCompileopt(aclCompileOpt::ACL_PRECISION_MODE, precision_mode), error code is 500001 [ERROR] 2025-07-30-07:29:05 (PID:1957, Device:0, RankID:-1) ERR00100 PTA call acl api failed [Error]: The internal ACL of the system is incorrect. Rectify the fault based on the error information in the ascend log. E90000: [PID: 1957] 2025-07-30-07:29:05.549.359 Compile operator failed, cause: module '__main__' has no attribute '__spec__' File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/interface.py", line 33, in cann_kb_init return RouteServer.initialize(**locals()) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 54, in wrapper return func(cls, *args, **kwargs) File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py", line 169, in initialize main_mod, main_path = config_main_info() File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/common.py", line 37, in config_main_info main_module_name = getattr(main_module.__spec__, "name", None) TraceBack (most recent call last): AOE Failed to call InitCannKB[FUNC:Initialize][FILE:python_adapter_manager.cc][LINE:47] Failed to initialize TeConfigInfo. [GraphOpt][InitializeInner][InitTbeFunc] Failed to init tbe.[FUNC:InitializeTeFusion][FILE:tbe_op_store_adapter.cc][LINE:1889] [GraphOpt][InitializeInner][InitTeFusion]: Failed to initialize TeFusion.[FUNC:InitializeInner][FILE:tbe_op_store_adapter.cc][LINE:1856] [SubGraphOpt][PreCompileOp][InitAdapter] InitializeAdapter adapter [tbe_op_adapter] failed! Ret [4294967295][FUNC:InitializeAdapter][FILE:op_store_adapter_manager.cc][LINE:79] [SubGraphOpt][PreCompileOp][Init] Initialize op store adapter failed, OpsStoreName[tbe-custom].[FUNC:Initialize][FILE:op_store_adapter_manager.cc][LINE:120] [FusionMngr][Init] Op store adapter manager init failed.[FUNC:Initialize][FILE:fusion_manager.cc][LINE:115] PluginManager InvokeAll failed.[FUNC:Initialize][FILE:ops_kernel_manager.cc][LINE:83] OpsManager initialize failed.[FUNC:InnerInitialize][FILE:gelib.cc][LINE:259] GELib::InnerInitialize failed.[FUNC:Initialize][FILE:gelib.cc][LINE:184] GEInitialize failed.[FUNC:GEInitialize][FILE:ge_api.cc][LINE:371] [Initialize][Ge]GEInitialize failed. ge result = 4294967295[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] [Init][Compiler]Init compiler failed[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145] [Set][Options]OpCompileProcessor init failed![FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145] > /models/z50051264/bitsandbytes-pangu/examples/inference/nf4.py(18)<module>() -> torch.npu.synchronize() (Pdb) ``` 为什么我直接运行没问题,但是使用pdb调试就会报错???
最新发布
07-31
### TensorFlow Keras 模块报错解决方案 当遇到 `tensorflow.keras` 或者 `tensorflow.python.keras` 的使用过程中出现问题时,通常可以从以下几个方面来排查并解决问题。 #### 1. 版本兼容性问题 确保所使用的 TensorFlow 和其他依赖库版本相互兼容是非常重要的。同版本之间可能存在 API 变化或弃用的情况[^1]。建议查看官方文档中的安装指南部分,确认当前环境下的 Python、TensorFlow以及其他相关包的版本是否匹配。 ```bash pip install --upgrade tensorflow==2.x.x # 替换为具体稳定版号 ``` #### 2. 导入路径一致引发错误 有时开发者可能会混淆 `tf.keras` 和独立安装的 Keras 库之间的区别。实际上,在 TensorFlow 2.x 中推荐直接通过 `import tensorflow as tf; from tensorflow import keras` 来引入 Keras 接口[^2]。如果之前有单独安装过 standalone Keras,则可能引起冲突,应考虑卸载后者以避免潜在的问题。 #### 3. 配置文件设置当 对于某些特定场景下(比如分布式训练),还需要注意检查配置项是否正确设定。例如 GPU 设备可见性、内存增长选项等都可以影响到程序运行状态[^3]: ```python import tensorflow as tf gpus = tf.config.experimental.list_physical_devices('GPU') if gpus: try: for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True) except RuntimeError as e: print(e) ``` #### 4. 数据集加载与预处理环节 数据读取方式以及格式转换过程也是常见的故障点之一。确保输入给模型的数据形状、类型均符合预期要求,并且批次大小合理分配可以有效减少很多必要的麻烦[^4]。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值