file-type

RunLoop配置实践:线程中的高效管理

PDF文件

287KB | 更新于2024-09-01 | 163 浏览量 | 0 下载量 举报 收藏
download 立即下载
"《Threading Programming Guide》笔记3:RunLoop操作配置实践" 本文是对苹果官方文档《Threading Programming Guide》的深入解读,聚焦于RunLoop在实际编程中的操作和配置。作者付宇轩在系列笔记的第三部分,分享了如何在自创线程中创建、配置和管理RunLoop。 在iOS和macOS开发中,RunLoop是线程管理的核心组件,特别是在主线程中,它是随应用程序启动而自动创建并运行的。由于现代Xcode项目模板会默认处理主线程的RunLoop,开发者通常无需关注其状态。然而,在自定义的子线程中,开发者则需要手动创建和配置RunLoop,以便更好地控制线程的行为和资源消耗。 RunLoop的主要功能是保持线程活跃但不消耗过多资源。是否需要在子线程中使用RunLoop,取决于线程的任务性质。一次性任务或长时间运行的任务可能不需要RunLoop,而周期性任务、需要频繁与主线程交互的任务,或者利用Cocoa框架的`performSelector…`系列方法时,则强烈推荐使用RunLoop。使用RunLoop的典型场景包括: 1. 基于端口或自定义数据源与其他线程通信。 2. 在线程上执行定时任务。 3. 使用Cocoa框架的定时器功能。 4. 执行周期性或频繁的任务。 为了操作RunLoop,我们需要首先获取到RunLoop对象。在Cocoa层,我们可以使用`NSRunLoop`类,而在CoreFoundation层,我们则需要操作`CFRunLoopRef`指针。尽管两种框架都没有提供直接创建RunLoop的方法,但我们可以通过以下方式获取线程的RunLoop: - `NSRunLoop`: 使用`currentRunLoop`方法获取当前线程的RunLoop。 - `CFRunLoopRef`: 调用`CFRunLoopGetCurrent()`函数获取当前线程的RunLoop引用。 一旦获取到RunLoop对象,就可以添加输入源(Inputsources)、定时器(Timers)和观察者(Observers)来定制RunLoop的行为。例如,输入源可以响应特定事件,定时器用于定期触发某些操作,而观察者则可以在RunLoop状态改变时执行回调。 配置RunLoop涉及到设置运行模式(Run Modes),不同的运行模式决定了哪些输入源和定时器会被激活。`NSRunLoop`提供了`commonModes`属性,包含了常用模式,如`NSDefaultRunLoopMode`,可以确保在大多数情况下,RunLoop都能正常工作。 理解并合理使用RunLoop对于优化iOS和macOS应用程序的性能和资源管理至关重要。开发者需要根据应用的需求,精确地创建、配置和调度RunLoop,以实现高效、低耗的多线程编程。通过深入学习《Threading Programming Guide》以及实践,可以进一步提升这方面的能力。

相关推荐

filetype

[2025-09-03 19:55:59] 140706823c44:16583:16896 [1] transport/net_ib.cc:2458 NCCL WARN NET/IB: Got completion from peer 127.0.0.1<19868> with status=12 opcode=0 len=0 vendor err 129 (Recv) localGid fe80::5200:e6ff:feef:3805 remoteGidsfe80::5200:e6ff:feef:3835 hca ibp0p1 140706823c44:16583:16896 [1] NCCL INFO transport/net.cc:1393 -> 6 [2025-09-03 19:55:59] 140706823c44:16584:16900 [2] transport/net_ib.cc:2458 NCCL WARN NET/IB: Got completion from peer 127.0.0.1<39536> with status=12 opcode=0 len=0 vendor err 129 (Recv) localGid fe80::5200:e6ff:feef:48b4 remoteGidsfe80::5200:e6ff:feef:3804 hca ibp3p0 140706823c44:16584:16900 [2] NCCL INFO transport/net.cc:1393 -> 6 [2025-09-03 19:55:59] 140706823c44:16582:16894 [0] transport/net_ib.cc:2458 NCCL WARN NET/IB: Got completion from peer 127.0.0.1<39536> with status=12 opcode=0 len=0 vendor err 129 (Recv) localGid fe80::5200:e6ff:feef:3837 remoteGidsfe80::5200:e6ff:feef:4887 hca ibp1p3 140706823c44:16582:16894 [0] NCCL INFO transport/net.cc:1393 -> 6 [2025-09-03 19:55:59] 140706823c44:16585:16898 [3] transport/net_ib.cc:2458 NCCL WARN NET/IB: Got completion from peer 127.0.0.1<55900> with status=12 opcode=0 len=0 vendor err 129 (Recv) localGid fe80::5200:e6ff:feef:4886 remoteGidsfe80::5200:e6ff:feef:48b6 hca ibp2p2 140706823c44:16585:16898 [3] NCCL INFO transport/net.cc:1393 -> 6 [rank1]:[E903 20:05:00.083473453 ProcessGroupNCCL.cpp:685] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out. [rank1]:[E903 20:05:00.083725262 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 1] failure detected by watchdog at work sequence id: 1 PG status: last enqueued work: 1, last completed work: -1 [rank1]:[E903 20:05:00.083739310 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value. [rank1]:[E903 20:05:00.083812782 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 1] First PG on this rank to signal dumping. [rank0]:[E903 20:05:00.155314876 ProcessGroupNCCL.cpp:685] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. [rank0]:[E903 20:05:00.155505084 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 0] failure detected by watchdog at work sequence id: 1 PG status: last enqueued work: 1, last completed work: -1 [rank0]:[E903 20:05:00.155519676 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value. [rank0]:[E903 20:05:00.155597596 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 0] First PG on this rank to signal dumping. [rank3]:[E903 20:05:00.160329608 ProcessGroupNCCL.cpp:685] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out. [rank3]:[E903 20:05:00.160487240 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 3] failure detected by watchdog at work sequence id: 1 PG status: last enqueued work: 1, last completed work: -1 [rank3]:[E903 20:05:00.160497256 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value. [rank3]:[E903 20:05:00.160547272 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 3] First PG on this rank to signal dumping. [rank2]:[E903 20:05:00.249766401 ProcessGroupNCCL.cpp:685] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. [rank2]:[E903 20:05:00.249872834 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 2] failure detected by watchdog at work sequence id: 1 PG status: last enqueued work: 1, last completed work: -1 [rank2]:[E903 20:05:00.249881826 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value. [rank2]:[E903 20:05:00.249934338 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 2] First PG on this rank to signal dumping. [rank3]:[E903 20:05:00.459968512 ProcessGroupNCCL.cpp:1870] [PG ID 0 PG GUID 0(default_pg) Rank 3] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank0]:[E903 20:05:00.459959936 ProcessGroupNCCL.cpp:1870] [PG ID 0 PG GUID 0(default_pg) Rank 0] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank2]:[E903 20:05:00.459993568 ProcessGroupNCCL.cpp:1870] [PG ID 0 PG GUID 0(default_pg) Rank 2] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank3]:[E903 20:05:00.460112000 ProcessGroupNCCL.cpp:1589] [PG ID 0 PG GUID 0(default_pg) Rank 3] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1 [rank0]:[E903 20:05:00.460113952 ProcessGroupNCCL.cpp:1589] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1 [rank2]:[E903 20:05:00.460117472 ProcessGroupNCCL.cpp:1589] [PG ID 0 PG GUID 0(default_pg) Rank 2] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1 [rank1]:[E903 20:05:00.463863369 ProcessGroupNCCL.cpp:1870] [PG ID 0 PG GUID 0(default_pg) Rank 1] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank1]:[E903 20:05:00.464040010 ProcessGroupNCCL.cpp:1589] [PG ID 0 PG GUID 0(default_pg) Rank 1] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1 140706823c44:16583:17177 [1] NCCL INFO misc/socket.cc:64 -> 3 140706823c44:16583:17177 [1] NCCL INFO misc/socket.cc:81 -> 3 140706823c44:16583:17177 [1] NCCL INFO misc/socket.cc:864 -> 3 140706823c44:16583:16885 [1] NCCL INFO misc/socket.cc:916 -> 3 Creation of the build directory /opt/Megatron-LM/megatron/legacy/fused_kernels/build failed 140706823c44:16582:17179 [0] NCCL INFO misc/socket.cc:64 -> 3 140706823c44:16582:17179 [0] NCCL INFO misc/socket.cc:81 -> 3 140706823c44:16582:17179 [0] NCCL INFO misc/socket.cc:864 -> 3 140706823c44:16582:16886 [0] NCCL INFO misc/socket.cc:916 -> 3 140706823c44:16585:17180 [3] NCCL INFO misc/socket.cc:64 -> 3 140706823c44:16585:17180 [3] NCCL INFO misc/socket.cc:81 -> 3 140706823c44:16585:17180 [3] NCCL INFO misc/socket.cc:864 -> 3 140706823c44:16585:16890 [3] NCCL INFO misc/socket.cc:916 -> 3 Creation of the build directory /opt/Megatron-LM/megatron/legacy/fused_kernels/build failed 140706823c44:16584:17182 [2] NCCL INFO misc/socket.cc:64 -> 3 140706823c44:16584:17182 [2] NCCL INFO misc/socket.cc:81 -> 3 140706823c44:16584:17182 [2] NCCL INFO misc/socket.cc:864 -> 3 140706823c44:16584:16889 [2] NCCL INFO misc/socket.cc:916 -> 3 Creation of the build directory /opt/Megatron-LM/megatron/legacy/fused_kernels/build failed 140706823c44:16583:17177 [1] NCCL INFO comm 0xafcbcf8345e0 rank 1 nranks 4 cudaDev 1 busId 906000 - Abort COMPLETE [rank1]:[E903 20:06:01.727205717 ProcessGroupNCCL.cpp:746] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E903 20:06:01.727244053 ProcessGroupNCCL.cpp:760] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E903 20:06:01.727875735 ProcessGroupNCCL.cpp:2068] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out. Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xfdcd77ee4180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xfdcd78e09b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xfdcd78e1038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xfdcd78e118ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdaa9c (0xfdcd7797aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x7d5b8 (0xfdcdb9a6d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0xe5edc (0xfdcdb9ad5edc in /lib/aarch64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank1]: Traceback (most recent call last): [rank1]: File "/opt/Megatron-LM/pretrain_gpt.py", line 245, in <module> [rank1]: pretrain( [rank1]: File "/opt/Megatron-LM/megatron/training/training.py", line 193, in pretrain [rank1]: initialize_megatron(extra_args_provider=extra_args_provider, [rank1]: File "/opt/Megatron-LM/megatron/training/initialize.py", line 100, in initialize_megatron [rank1]: _compile_dependencies() [rank1]: File "/opt/Megatron-LM/megatron/training/initialize.py", line 173, in _compile_dependencies [rank1]: torch.distributed.barrier() [rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper [rank1]: return func(*args, **kwargs) [rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4811, in barrier [rank1]: work = group.barrier(opts=opts) [rank1]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. what(): [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out. Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xfdcd77ee4180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xfdcd78e09b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xfdcd78e1038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xfdcd78e118ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdaa9c (0xfdcd7797aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x7d5b8 (0xfdcdb9a6d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0xe5edc (0xfdcdb9ad5edc in /lib/aarch64-linux-gnu/libc.so.6) Exception raised from run at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2074 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xfdcd77ee4180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x11b88e0 (0xfdcd78dc88e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::run() + 0x45c (0xfdcd78e11c3c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: <unknown function> + 0xdaa9c (0xfdcd7797aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #4: <unknown function> + 0x7d5b8 (0xfdcdb9a6d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #5: <unknown function> + 0xe5edc (0xfdcdb9ad5edc in /lib/aarch64-linux-gnu/libc.so.6) 140706823c44:16585:17180 [3] NCCL INFO comm 0xba60cf534e60 rank 3 nranks 4 cudaDev 3 busId 1906000 - Abort COMPLETE [rank3]:[E903 20:06:01.856367983 ProcessGroupNCCL.cpp:746] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank3]:[E903 20:06:01.856392463 ProcessGroupNCCL.cpp:760] [Rank 3] To avoid data inconsistency, we are taking the entire process down. 140706823c44:16582:17179 [0] NCCL INFO comm 0xbdefb65320e0 rank 0 nranks 4 cudaDev 0 busId 806000 - Abort COMPLETE [rank3]:[E903 20:06:01.856965041 ProcessGroupNCCL.cpp:2068] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out. Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xe7e01d684180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xe7e01e5a9b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xe7e01e5b038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xe7e01e5b18ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdaa9c (0xe7e01d11aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x7d5b8 (0xe7e05f20d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0xe5edc (0xe7e05f275edc in /lib/aarch64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank0]:[E903 20:06:01.857166705 ProcessGroupNCCL.cpp:746] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank0]:[E903 20:06:01.857203857 ProcessGroupNCCL.cpp:760] [Rank 0] To avoid data inconsistency, we are taking the entire process down. what(): [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out. Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xe7e01d684180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xe7e01e5a9b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xe7e01e5b038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xe7e01e5b18ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdaa9c (0xe7e01d11aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x7d5b8 (0xe7e05f20d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0xe5edc (0xe7e05f275edc in /lib/aarch64-linux-gnu/libc.so.6) Exception raised from run at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2074 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xe7e01d684180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x11b88e0 (0xe7e01e5688e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::run() + 0x45c (0xe7e01e5b1c3c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: <unknown function> + 0xdaa9c (0xe7e01d11aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #4: <unknown function> + 0x7d5b8 (0xe7e05f20d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #5: <unknown function> + 0xe5edc (0xe7e05f275edc in /lib/aarch64-linux-gnu/libc.so.6) [rank0]:[E903 20:06:01.857829171 ProcessGroupNCCL.cpp:2068] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xf144dda24180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xf144de949b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xf144de95038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xf144de9518ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdaa9c (0xf144dd4baa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x7d5b8 (0xf1451f5ad5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0xe5edc (0xf1451f615edc in /lib/aarch64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank0]: Traceback (most recent call last): [rank0]: File "/opt/Megatron-LM/pretrain_gpt.py", line 245, in <module> [rank0]: pretrain( [rank0]: File "/opt/Megatron-LM/megatron/training/training.py", line 193, in pretrain [rank0]: initialize_megatron(extra_args_provider=extra_args_provider, [rank0]: File "/opt/Megatron-LM/megatron/training/initialize.py", line 100, in initialize_megatron [rank0]: _compile_dependencies() [rank0]: File "/opt/Megatron-LM/megatron/training/initialize.py", line 173, in _compile_dependencies [rank0]: torch.distributed.barrier() [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper [rank0]: return func(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4811, in barrier [rank0]: work = group.barrier(opts=opts) [rank0]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. what(): [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xf144dda24180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xf144de949b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xf144de95038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xf144de9518ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdaa9c (0xf144dd4baa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x7d5b8 (0xf1451f5ad5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0xe5edc (0xf1451f615edc in /lib/aarch64-linux-gnu/libc.so.6) Exception raised from run at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2074 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xf144dda24180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x11b88e0 (0xf144de9088e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::run() + 0x45c (0xf144de951c3c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: <unknown function> + 0xdaa9c (0xf144dd4baa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #4: <unknown function> + 0x7d5b8 (0xf1451f5ad5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #5: <unknown function> + 0xe5edc (0xf1451f615edc in /lib/aarch64-linux-gnu/libc.so.6) W0903 20:06:01.224000 16517 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 16582 closing signal SIGTERM W0903 20:06:01.225000 16517 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 16584 closing signal SIGTERM W0903 20:06:01.226000 16517 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 16585 closing signal SIGTERM E0903 20:06:01.856000 16517 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: -6) local_rank: 1 (pid: 16583) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 7, in <module> sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 901, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 143, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 277, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ====================================================== pretrain_gpt.py FAILED ------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-09-03_20:06:01 host : 140706823c44 rank : 1 (local_rank: 1) exitcode : -6 (pid: 16583) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 16583 ======================================================

filetype
标题SpringBoot智能在线预约挂号系统研究AI更换标题第1章引言介绍智能在线预约挂号系统的研究背景、意义、国内外研究现状及论文创新点。1.1研究背景与意义阐述智能在线预约挂号系统对提升医疗服务效率的重要性。1.2国内外研究现状分析国内外智能在线预约挂号系统的研究与应用情况。1.3研究方法及创新点概述本文采用的技术路线、研究方法及主要创新点。第2章相关理论总结智能在线预约挂号系统相关理论,包括系统架构、开发技术等。2.1系统架构设计理论介绍系统架构设计的基本原则和常用方法。2.2SpringBoot开发框架理论阐述SpringBoot框架的特点、优势及其在系统开发中的应用。2.3数据库设计与管理理论介绍数据库设计原则、数据模型及数据库管理系统。2.4网络安全与数据保护理论讨论网络安全威胁、数据保护技术及其在系统中的应用。第3章SpringBoot智能在线预约挂号系统设计详细介绍系统的设计方案,包括功能模块划分、数据库设计等。3.1系统功能模块设计划分系统功能模块,如用户管理、挂号管理、医生排班等。3.2数据库设计与实现设计数据库表结构,确定字段类型、主键及外键关系。3.3用户界面设计设计用户友好的界面,提升用户体验。3.4系统安全设计阐述系统安全策略,包括用户认证、数据加密等。第4章系统实现与测试介绍系统的实现过程,包括编码、测试及优化等。4.1系统编码实现采用SpringBoot框架进行系统编码实现。4.2系统测试方法介绍系统测试的方法、步骤及测试用例设计。4.3系统性能测试与分析对系统进行性能测试,分析测试结果并提出优化建议。4.4系统优化与改进根据测试结果对系统进行优化和改进,提升系统性能。第5章研究结果呈现系统实现后的效果,包括功能实现、性能提升等。5.1系统功能实现效果展示系统各功能模块的实现效果,如挂号成功界面等。5.2系统性能提升效果对比优化前后的系统性能