[Feature] Load weights from distributed #470

Edenzzzz · 2025-06-03T20:20:07Z

The read speed of nvme SSDs is only a few GB/s, far slower than the 800GB/s on NVLink. This PR supports loading weights on one rank and broadcasting them to the node.
After this PR, text encoders will be offloaded (w/ layer-wise prefetch) to cpu by default using FSDP, because encoder tests will simply OOM on A40 without this.

Results on H100 and 14B model (Wan-AI/Wan2.1-T2V-14B-Diffusers)
python examples/inference/basic/basic.py

Method	2 GPUs	4 GPUs
From disk	16 s	21 s
Broadcast (sync)	14 s	15 s
Broadcast (async)	11 s	14 s

TODO

Attach performance numbers
Use scatter instead of broadcast (in a future PR)
Support multithreaded loading (Support multi-thread model weight loading sgl-project/sglang#7277)

jzhang38 · 2025-06-11T03:02:35Z

How is the performance so far?

Edenzzzz · 2025-06-11T04:03:12Z

Need to offload encoder for CIs to pass first

Edenzzzz

Let's merge this to avoid more conflicts from get_local_torch_device

BrianChen1129 · 2025-06-28T03:47:05Z

why the transformer test failed?

Edenzzzz · 2025-06-28T03:51:38Z

#557 (comment) it failed a while ago

BrianChen1129 · 2025-06-28T03:53:17Z

#557 (comment) it failed a while ago

okay, then just merge

This reverts commit c5155b2.

Edenzzzz and others added 2 commits June 3, 2025 15:18

fix

d68476d

pre-commit

9f61a8f

Edenzzzz had a problem deploying to runpod-runners June 3, 2025 20:27 — with GitHub Actions Failure

Edenzzzz and others added 2 commits June 8, 2025 19:44

fix

69c78a9

Merge branch 'main' into dist_load

b37d49b

Edenzzzz had a problem deploying to runpod-runners June 9, 2025 02:46 — with GitHub Actions Failure

Edenzzzz changed the title ~~Load weights from distributed~~ [Feature] Load weights from distributed Jun 9, 2025

trigger ci

af24ad8

Edenzzzz temporarily deployed to runpod-runners June 9, 2025 15:39 — with GitHub Actions Inactive

Edenzzzz had a problem deploying to runpod-runners June 9, 2025 15:39 — with GitHub Actions Failure

Edenzzzz had a problem deploying to runpod-runners June 9, 2025 15:59 — with GitHub Actions Failure

Edenzzzz temporarily deployed to runpod-runners June 9, 2025 15:59 — with GitHub Actions Inactive

Edenzzzz had a problem deploying to runpod-runners June 9, 2025 15:59 — with GitHub Actions Failure

Edenzzzz had a problem deploying to runpod-runners June 10, 2025 03:27 — with GitHub Actions Failure

Edenzzzz mentioned this pull request Jun 10, 2025

[Feature] Development Roadmap (V1 Training & Distillation & VSA Release) #468

Closed

25 tasks

is -> ==

3f8e18d

Edenzzzz temporarily deployed to runpod-runners June 10, 2025 16:41 — with GitHub Actions Inactive

Edenzzzz had a problem deploying to runpod-runners June 10, 2025 16:41 — with GitHub Actions Failure

Merge branch 'main' into dist_load

284a712

Edenzzzz temporarily deployed to runpod-runners June 11, 2025 00:02 — with GitHub Actions Inactive

Edenzzzz had a problem deploying to runpod-runners June 11, 2025 00:02 — with GitHub Actions Failure

Edenzzzz temporarily deployed to runpod-runners June 11, 2025 00:02 — with GitHub Actions Inactive

Edenzzzz had a problem deploying to runpod-runners June 11, 2025 00:02 — with GitHub Actions Failure

add _fsdp_shard_conditions for t5

fa23271

Edenzzzz had a problem deploying to runpod-runners June 28, 2025 00:47 — with GitHub Actions Error

Edenzzzz added 2 commits June 28, 2025 00:47

fix

7f63b94

fix

959fbec

Edenzzzz temporarily deployed to runpod-runners June 28, 2025 00:49 — with GitHub Actions Inactive

Edenzzzz had a problem deploying to runpod-runners June 28, 2025 00:49 — with GitHub Actions Failure

Edenzzzz temporarily deployed to runpod-runners June 28, 2025 00:49 — with GitHub Actions Inactive

Edenzzzz had a problem deploying to runpod-runners June 28, 2025 00:49 — with GitHub Actions Failure

Edenzzzz requested a review from SolitaryThinker June 28, 2025 01:27

Edenzzzz commented Jun 28, 2025

View reviewed changes

Edenzzzz requested a review from BrianChen1129 June 28, 2025 01:34

BrianChen1129 approved these changes Jun 28, 2025

View reviewed changes

Edenzzzz merged commit c5155b2 into main Jun 28, 2025
11 of 13 checks passed

Edenzzzz deleted the dist_load branch June 28, 2025 03:52

Edenzzzz mentioned this pull request Jun 29, 2025

Set encoder TP size to 1 by default #569

Merged

SolitaryThinker added a commit that referenced this pull request Jun 29, 2025

Revert "[Feature] Load weights from distributed (#470)"

27edc56

This reverts commit c5155b2.

SolitaryThinker mentioned this pull request Jun 29, 2025

[Revert] "[Feature] Load weights from distributed" #571

Merged

qimcis pushed a commit to qimcis/FastVideo that referenced this pull request Oct 30, 2025

[Feature] Load weights from distributed (hao-ai-lab#470)

fb0cc83

Edenzzzz mentioned this pull request Nov 8, 2025

Fix distributed weight loading in multi-node training #572

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Load weights from distributed #470

[Feature] Load weights from distributed #470

Uh oh!

Edenzzzz commented Jun 3, 2025 •

edited

Loading

Uh oh!

jzhang38 commented Jun 11, 2025

Uh oh!

Edenzzzz commented Jun 11, 2025

Uh oh!

Edenzzzz left a comment

Uh oh!

BrianChen1129 commented Jun 28, 2025

Uh oh!

Edenzzzz commented Jun 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

BrianChen1129 commented Jun 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Feature] Load weights from distributed #470

[Feature] Load weights from distributed #470

Uh oh!

Conversation

Edenzzzz commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Uh oh!

jzhang38 commented Jun 11, 2025

Uh oh!

Edenzzzz commented Jun 11, 2025

Uh oh!

Edenzzzz left a comment

Choose a reason for hiding this comment

Uh oh!

BrianChen1129 commented Jun 28, 2025

Uh oh!

Edenzzzz commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

BrianChen1129 commented Jun 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Edenzzzz commented Jun 3, 2025 •

edited

Loading

Edenzzzz commented Jun 28, 2025 •

edited

Loading