KEP-30: add role coordination kep #59

gujingit · 2025-10-17T03:56:19Z

Ⅰ. Motivation

Ⅱ. Modifications

Add Role coordination KEP

Ⅲ. Does this pull request fix one issue?

NONE

Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅴ. Describe how to verify it

VI. Special notes for reviews

Checklist

Format your code make fmt.
Add unit tests or integration tests.
Update the documentation related to the change.

codecov-commenter · 2025-10-17T04:01:20Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

cheyang · 2025-10-18T06:12:25Z

keps/30-role-coordination/kep.yaml

+title: KEP Template
+kep-number: NNNN
+authors:
+  - "@jane.doe"


Please update the information of this file.

cheyang · 2025-10-19T08:08:12Z

keps/30-role-coordination/README.md

+Each coordination step will:
+
+1. Apply the specified strategies to the relevant roles
+2. Monitor the status of those roles


I suggest adding the detail design to instruct the router component to split requests between the old and new Pods in real time during a RoleBasedGroup coordinated upgrade, so that it can integrate with SGLang Router's rolling update workflow

ZYecho11

I think this association upgrade strategy can be considered as an enhancement and supplement to the original upgrade strategy, rather than a replacement relationship. For example, if the user's original Role maxUnavailable is 5, but wants to configure a 4:1 association upgrade relationship, does it need to modify the original maxUnavailable to 4?

gujingit · 2025-10-24T02:27:39Z

API design has updated. Remove maxUnavailable & maxSurge, and use batchSize to represent the ratio between roles.

Copilot

Pull Request Overview

This PR introduces KEP-30 (Role Coordination for RoleBasedGroup), which proposes a coordination mechanism for managing updates across multiple roles in a RoleBasedGroup. The enhancement enables phased rollouts where roles are updated in a coordinated sequence while maintaining specific ratios (e.g., 2:1 prefill-to-decode for LLM inference workloads).

Key Changes:

New coordination API to define cross-role update strategies with configurable batch sizes and partitions
Controller logic to execute coordinated updates based on defined strategies
Support for maintaining role ratios during rolling updates (e.g., 4P2D pattern for PD-disaggregated LLM inference)

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
keps/NNNN-kep-template/kep.yaml	Template KEP metadata file added
keps/NNNN-kep-template/README.md	Template KEP documentation added
keps/30-role-coordination/kep.yaml	KEP-30 metadata with incorrect kep-number (21 instead of 30) and title
keps/30-role-coordination/README.md	Complete KEP-30 specification including API design, user stories, and implementation details

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-10-26T15:40:17Z

keps/30-role-coordination/kep.yaml

@@ -0,0 +1,15 @@
+title: KEP Template
+kep-number: 21


The kep-number is set to 21 but the directory name and README title indicate this is KEP-30. This inconsistency should be corrected to 30.

Suggested change

kep-number: 21

kep-number: 30

Copilot · 2025-10-26T15:40:18Z

keps/NNNN-kep-template/README.md

+template.
+
+Ensure the TOC is wrapped with
+  <code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>


Corrected HTML entity closing tags from '&rt;' to '>' in the TOC generation instruction.

Suggested change

<code></code>

Copilot · 2025-10-26T15:40:18Z

keps/30-role-coordination/README.md

+type RoleBasedGroupSpec struct {
+	// Existing fields...
+
+	// Coordination defines how roles should be coordinated 


The comment is missing a period at the end. Go documentation comments should be complete sentences ending with punctuation.

Suggested change

// Coordination defines how roles should be coordinated

// Coordination defines how roles should be coordinated.

Copilot · 2025-10-26T15:40:18Z

keps/30-role-coordination/README.md

+        - role: decode
+          updateStrategy:
+            partition: 5
+            batchSize: 4


The batchSize for the decode role is set to 4, but according to the example description and the strategy in lines 211-220, decode should have batchSize: 1 to maintain the 4:1 ratio with prefill (batchSize: 4).

Suggested change

batchSize: 4

batchSize: 1

cheyang · 2025-10-26T16:11:26Z

keps/30-role-coordination/README.md

+40 prefill pods and 10 decode pods were upgraded in a rolling update with a 4:1 ratio of prefill to decode.
+After rolling updates of 20 prefill pods and 5 decode pods, the process was paused.
+```yaml
+apiVersion: workloads.x-k8s.io/v1alpha1


I recommend retaining support for multiple strategy types such as updateStrategy, scaleStrategy, and gangStrategy, while using a unified config structure internally implemented as a map. The API would be:

apiVersion: workloads.x-k8s.io/v1alpha1 kind: RoleBasedGroup spec: coordinations: - name: "prefill-decode" strategies: - role: prefill strategyType: updateStrategy # Explicit strategy type config: # Universal map-based config partition: 20 batchSize: 2 - role: decode strategyType: updateStrategy config: partition: 5 batchSize: 1 - name: "router-planner" strategies: - role: router strategyType: gangStrategy # Different strategy type config: # Same config structure minAvailable: 2 timeout: 10m - role: planner strategyType: scaleStrategy # Third strategy type config: minReplicas: 3 maxReplicas: 10

apiVersion: workloads.x-k8s.io/v1alpha1 kind: RoleBasedGroup metadata: name: role-coordination spec: coordination: - name: "prefill-decode-update" type: updateStrategy # Explicit strategy type strategy: - role: prefill config: # Universal map-based config type: Recreate # [Recreate, InplaceIfPossible], default is Recreate batchSize: 2 - role: decode config: type: Recreate # [Recreate, InplaceIfPossible], default is Recreate batchSize: 1 # other coordination strategy type demo - name: "prefill-decode-scaling" # Support later type: scalingStrategy # Different strategy type strategies: - role: prefill config: # Same config structure batchSize: 2 - role: decode config: batchSize: 1 - name: "prefill-decode-gang" # Other type will support on demand type: gangStrategy # Different strategy type strategies: - role: prefill config: # Same config structure minAvailable: 2 timeout: 10m - role: decode config: minAvailable: 2 timeout: 10m

Key Implementation Features:

Unified Config Structure

config: # Common structure for all strategies key1: value1 key2: value2 ...

Strategy Type Flexibility

type Strategy struct { Role string StrategyType string // updateStrategy|scaleStrategy|gangStrategy Config map[string]string }

Validation Handling

func Validate(config map[string]string, strategyType string) error { switch strategyType { case "updateStrategy": return validateUpdateConfig(config) case "gangStrategy": return validateGangConfig(config) case "scaleStrategy": return validateScaleConfig(config) default: return fmt.Errorf("unknown strategy type: %s", strategyType) } }

This maintains API consistency while enabling diverse strategy implementations through:

Single config field with flexible key-value pairs

Explicit strategyType designation for behavior dispatch

Backward compatibility with existing updateStrategy configurations

Clean extension path for future strategy types

Why is it necessary to set gangPolicy under the coordinates field? What is User Story？

I need to deploy a minimal structure consisting of at least one router, two prefiller, and two decoder instances using RBG.

apiVersion: workloads.x-k8s.io/v1alpha1 kind: RoleBasedGroup spec: coordinations: - name: "minimal-deployment" strategies: - role: router strategyType: gangStrategy config: minAvailable: 1 # Minimum 1 router - role: prefill strategyType: gangStrategy config: minAvailable: 2 # Minimum 2 prefill - role: decode strategyType: gangStrategy config: minAvailable: 2 # Minimum 2 decode

isn't this information already defined in the podgroup and podgroupPolicy? Why do we need to repeat the definition？

The difference is that coordination needs to support multiple atomic gangs. For example:

Role prefill requires 22 replicas

Role decode requires 10 replicas

But each atomic gang must contain exactly 2 prefill + 1 decode pods each time to place in the same node.

This necessitates coordinated sequential gangs – a single PodGroup cannot express this multi-batch requirement.

thanks, get it

veophi · 2025-10-31T06:35:15Z

keps/30-role-coordination/README.md

+  name: role-coordination
+spec:
+  coordination:
+    - name: "prefill-decode-update"  # strategy 1: reconcile prefill & decode at 4:1 ratio


How about

coordination: - name: prefill-decode-update type: RollingUpdate roles: - prefill - decode strategy: maxSkew: 1% # the max skew of updated replicas between prefill and decode. For example, one rbg with (200p, 100d) allows `abs(updated_prefill/200 - updated_decode/100) < 1%`. partition: 99%

This design will binds the rolling update ratio to the replicas of prefill and decode. If the prefill or decode has HPA , their replica count is not accurate, the update ratio will not meet expectations.

Also this design can not support this case. There are 40 Prefill and 10 Decode pods and users want to update prefill:decode = 2:1.

veophi · 2025-10-31T06:40:32Z

keps/30-role-coordination/README.md

+	// existing independent role update behavior
+}
+
+func (r *RoleBasedGroupReconciler) executeCoordinationStrategy() {


Behavior when a single Pod becomes unhealthy during rollout?

If one Pod enters an Unhealthy state (e.g., readiness/failure probe fails due to some node problems) during a rolling upgrade, will the rbg controller block the entire upgrade — even when maxUnavailable is explicitly set to >1?

The rolling update process of the rbg controller follows the K8s sts design. If a old version pod enters an unhealthy state, it will not block the upgrade. If a new version pod (only single pod) enters an unhealthy state, it will block the upgrade process.

cheyang

/lgtm
/approve

add role coordination kep

d78c75d

gujingit force-pushed the feature/role-coordination branch from 4688446 to c845bb4 Compare October 17, 2025 08:52

cheyang reviewed Oct 18, 2025

View reviewed changes

cheyang requested changes Oct 19, 2025

View reviewed changes

update struct & yaml

470ee3c

gujingit force-pushed the feature/role-coordination branch from c845bb4 to 470ee3c Compare October 21, 2025 13:15

ZYecho11 reviewed Oct 22, 2025

View reviewed changes

add api design

2e432a9

cheyang self-requested a review October 24, 2025 02:51

gujingit mentioned this pull request Oct 24, 2025

[WIP] feat: support role cooridination #69

Closed

3 tasks

RongGu requested a review from Copilot October 26, 2025 15:39

Copilot AI reviewed Oct 26, 2025

View reviewed changes

cheyang reviewed Oct 26, 2025

View reviewed changes

Syspretor added 2 commits October 29, 2025 14:58

Update kep.yaml

5930abd

Refine API struct and naming

f119958

veophi reviewed Oct 31, 2025

View reviewed changes

Refine coordination api design

729ba3a

cheyang approved these changes Nov 5, 2025

View reviewed changes

cheyang merged commit 22e7ded into sgl-project:main Nov 5, 2025
3 checks passed

veophi mentioned this pull request Nov 5, 2025

api: add coordination api for rbg #84

Closed

3 tasks

	<code><!-- toc --&rt;<!-- /toc --&rt;</code>
	<code><!-- toc --><!-- /toc --></code>

	// Coordination defines how roles should be coordinated
	// Coordination defines how roles should be coordinated.

KEP-30: add role coordination kep #59

KEP-30: add role coordination kep #59

Uh oh!

Conversation

gujingit commented Oct 17, 2025

Ⅰ. Motivation

Ⅱ. Modifications

Ⅲ. Does this pull request fix one issue?

Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅴ. Describe how to verify it

VI. Special notes for reviews

Checklist

Uh oh!

codecov-commenter commented Oct 17, 2025

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cheyang Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZYecho11 left a comment

Choose a reason for hiding this comment

Uh oh!

gujingit commented Oct 24, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

cheyang Oct 26, 2025 • edited by Syspretor Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Key Implementation Features:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

veophi Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gujingit Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

veophi Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cheyang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

cheyang Oct 19, 2025 •

edited

Loading

cheyang Oct 26, 2025 •

edited by Syspretor

Loading

veophi Oct 31, 2025 •

edited

Loading

gujingit Nov 1, 2025 •

edited

Loading

veophi Oct 31, 2025 •

edited

Loading