-
Notifications
You must be signed in to change notification settings - Fork 37
KEP-30: add role coordination kep #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
4688446 to
c845bb4
Compare
keps/30-role-coordination/kep.yaml
Outdated
| title: KEP Template | ||
| kep-number: NNNN | ||
| authors: | ||
| - "@jane.doe" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the information of this file.
keps/30-role-coordination/README.md
Outdated
| Each coordination step will: | ||
|
|
||
| 1. Apply the specified strategies to the relevant roles | ||
| 2. Monitor the status of those roles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest adding the detail design to instruct the router component to split requests between the old and new Pods in real time during a RoleBasedGroup coordinated upgrade, so that it can integrate with SGLang Router's rolling update workflow
c845bb4 to
470ee3c
Compare
ZYecho11
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this association upgrade strategy can be considered as an enhancement and supplement to the original upgrade strategy, rather than a replacement relationship. For example, if the user's original Role maxUnavailable is 5, but wants to configure a 4:1 association upgrade relationship, does it need to modify the original maxUnavailable to 4?
|
API design has updated. Remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces KEP-30 (Role Coordination for RoleBasedGroup), which proposes a coordination mechanism for managing updates across multiple roles in a RoleBasedGroup. The enhancement enables phased rollouts where roles are updated in a coordinated sequence while maintaining specific ratios (e.g., 2:1 prefill-to-decode for LLM inference workloads).
Key Changes:
- New coordination API to define cross-role update strategies with configurable batch sizes and partitions
- Controller logic to execute coordinated updates based on defined strategies
- Support for maintaining role ratios during rolling updates (e.g., 4P2D pattern for PD-disaggregated LLM inference)
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| keps/NNNN-kep-template/kep.yaml | Template KEP metadata file added |
| keps/NNNN-kep-template/README.md | Template KEP documentation added |
| keps/30-role-coordination/kep.yaml | KEP-30 metadata with incorrect kep-number (21 instead of 30) and title |
| keps/30-role-coordination/README.md | Complete KEP-30 specification including API design, user stories, and implementation details |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
keps/30-role-coordination/kep.yaml
Outdated
| @@ -0,0 +1,15 @@ | |||
| title: KEP Template | |||
| kep-number: 21 | |||
Copilot
AI
Oct 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kep-number is set to 21 but the directory name and README title indicate this is KEP-30. This inconsistency should be corrected to 30.
| kep-number: 21 | |
| kep-number: 30 |
| template. | ||
|
|
||
| Ensure the TOC is wrapped with | ||
| <code><!-- toc --&rt;<!-- /toc --&rt;</code> |
Copilot
AI
Oct 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected HTML entity closing tags from '&rt;' to '>' in the TOC generation instruction.
| <code><!-- toc --&rt;<!-- /toc --&rt;</code> | |
| <code><!-- toc --><!-- /toc --></code> |
| type RoleBasedGroupSpec struct { | ||
| // Existing fields... | ||
|
|
||
| // Coordination defines how roles should be coordinated |
Copilot
AI
Oct 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is missing a period at the end. Go documentation comments should be complete sentences ending with punctuation.
| // Coordination defines how roles should be coordinated | |
| // Coordination defines how roles should be coordinated. |
keps/30-role-coordination/README.md
Outdated
| - role: decode | ||
| updateStrategy: | ||
| partition: 5 | ||
| batchSize: 4 |
Copilot
AI
Oct 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The batchSize for the decode role is set to 4, but according to the example description and the strategy in lines 211-220, decode should have batchSize: 1 to maintain the 4:1 ratio with prefill (batchSize: 4).
| batchSize: 4 | |
| batchSize: 1 |
| 40 prefill pods and 10 decode pods were upgraded in a rolling update with a 4:1 ratio of prefill to decode. | ||
| After rolling updates of 20 prefill pods and 5 decode pods, the process was paused. | ||
| ```yaml | ||
| apiVersion: workloads.x-k8s.io/v1alpha1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend retaining support for multiple strategy types such as updateStrategy, scaleStrategy, and gangStrategy, while using a unified config structure internally implemented as a map. The API would be:
apiVersion: workloads.x-k8s.io/v1alpha1
kind: RoleBasedGroup
spec:
coordinations:
- name: "prefill-decode"
strategies:
- role: prefill
strategyType: updateStrategy # Explicit strategy type
config: # Universal map-based config
partition: 20
batchSize: 2
- role: decode
strategyType: updateStrategy
config:
partition: 5
batchSize: 1
- name: "router-planner"
strategies:
- role: router
strategyType: gangStrategy # Different strategy type
config: # Same config structure
minAvailable: 2
timeout: 10m
- role: planner
strategyType: scaleStrategy # Third strategy type
config:
minReplicas: 3
maxReplicas: 10apiVersion: workloads.x-k8s.io/v1alpha1
kind: RoleBasedGroup
metadata:
name: role-coordination
spec:
coordination:
- name: "prefill-decode-update"
type: updateStrategy # Explicit strategy type
strategy:
- role: prefill
config: # Universal map-based config
type: Recreate # [Recreate, InplaceIfPossible], default is Recreate
batchSize: 2
- role: decode
config:
type: Recreate # [Recreate, InplaceIfPossible], default is Recreate
batchSize: 1
# other coordination strategy type demo
- name: "prefill-decode-scaling" # Support later
type: scalingStrategy # Different strategy type
strategies:
- role: prefill
config: # Same config structure
batchSize: 2
- role: decode
config:
batchSize: 1
- name: "prefill-decode-gang" # Other type will support on demand
type: gangStrategy # Different strategy type
strategies:
- role: prefill
config: # Same config structure
minAvailable: 2
timeout: 10m
- role: decode
config:
minAvailable: 2
timeout: 10mKey Implementation Features:
-
Unified Config Structure
config: # Common structure for all strategies key1: value1 key2: value2 ...
-
Strategy Type Flexibility
type Strategy struct { Role string StrategyType string // updateStrategy|scaleStrategy|gangStrategy Config map[string]string }
-
Validation Handling
func Validate(config map[string]string, strategyType string) error { switch strategyType { case "updateStrategy": return validateUpdateConfig(config) case "gangStrategy": return validateGangConfig(config) case "scaleStrategy": return validateScaleConfig(config) default: return fmt.Errorf("unknown strategy type: %s", strategyType) } }
This maintains API consistency while enabling diverse strategy implementations through:
- Single
configfield with flexible key-value pairs - Explicit
strategyTypedesignation for behavior dispatch - Backward compatibility with existing updateStrategy configurations
- Clean extension path for future strategy types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it necessary to set gangPolicy under the coordinates field? What is User Story?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to deploy a minimal structure consisting of at least one router, two prefiller, and two decoder instances using RBG.
apiVersion: workloads.x-k8s.io/v1alpha1
kind: RoleBasedGroup
spec:
coordinations:
- name: "minimal-deployment"
strategies:
- role: router
strategyType: gangStrategy
config:
minAvailable: 1 # Minimum 1 router
- role: prefill
strategyType: gangStrategy
config:
minAvailable: 2 # Minimum 2 prefill
- role: decode
strategyType: gangStrategy
config:
minAvailable: 2 # Minimum 2 decodeThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this information already defined in the podgroup and podgroupPolicy? Why do we need to repeat the definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference is that coordination needs to support multiple atomic gangs. For example:
- Role
prefillrequires 22 replicas - Role
decoderequires 10 replicas - But each atomic gang must contain exactly 2
prefill+ 1decodepods each time to place in the same node.
This necessitates coordinated sequential gangs – a single PodGroup cannot express this multi-batch requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, get it
keps/30-role-coordination/README.md
Outdated
| name: role-coordination | ||
| spec: | ||
| coordination: | ||
| - name: "prefill-decode-update" # strategy 1: reconcile prefill & decode at 4:1 ratio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about
coordination:
- name: prefill-decode-update
type: RollingUpdate
roles:
- prefill
- decode
strategy:
maxSkew: 1% # the max skew of updated replicas between prefill and decode. For example, one rbg with (200p, 100d) allows `abs(updated_prefill/200 - updated_decode/100) < 1%`.
partition: 99%There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This design will binds the rolling update ratio to the replicas of prefill and decode. If the prefill or decode has HPA , their replica count is not accurate, the update ratio will not meet expectations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this design can not support this case. There are 40 Prefill and 10 Decode pods and users want to update prefill:decode = 2:1.
keps/30-role-coordination/README.md
Outdated
| // existing independent role update behavior | ||
| } | ||
|
|
||
| func (r *RoleBasedGroupReconciler) executeCoordinationStrategy() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Behavior when a single Pod becomes unhealthy during rollout?
If one Pod enters an Unhealthy state (e.g., readiness/failure probe fails due to some node problems) during a rolling upgrade, will the rbg controller block the entire upgrade — even when maxUnavailable is explicitly set to >1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rolling update process of the rbg controller follows the K8s sts design. If a old version pod enters an unhealthy state, it will not block the upgrade. If a new version pod (only single pod) enters an unhealthy state, it will block the upgrade process.
cheyang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
Ⅰ. Motivation
Ⅱ. Modifications
Add Role coordination KEP
Ⅲ. Does this pull request fix one issue?
NONE
Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.
Ⅴ. Describe how to verify it
VI. Special notes for reviews
Checklist
make fmt.